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Preface to the Classic Edition 


This is the first of four books I have written; the one I worked the hardest on; 
and the one I am fondest of. It marked my goodbye to mathematics and 
probability theory. About the time the book was written, I left UCLA to go into 
the world of applied statistics and computing as a full-time freelance 
consultant. 

The book went out of print well over ten years ago, but before it did a 
generation of statisticians, engineers, and mathematicians learned graduate 
probability theory from its pages. Since the book became unavailable, I have 
received many calls asking where it could be bought and then for permission to 
copy part or all of it for use in graduate probability courses. 

These reminders that the book was not forgotten saddened me and I was 
delighted when SIAM offered to republish it in their Classics Series. The 
present edition is the same as the original except for the correction of a few 
misprints and errors, mainly minor. 

After the book was out for a few years it became commonplace for a 
younger participant at some professional meeting to lean over toward me and 
confide that he or she had studied probability out of my book. Lately, this has 
become rarer and the confiders older. With republication, I hope that the age 
and frequency trends will reverse direction. 


Leo Breiman 


University of California, Berkeley 
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Preface 


A few years ago I started a book by first writing a very extensive preface. I 
never finished that book and resolved that in the future I would write first the 
book and then the preface. Having followed this resolution I note that the result 
is a desire to be as brief as possible. 

This text developed from an introductory graduate course and seminar in 
probability theory at UCLA. A prerequisite is some knowledge of real variable 
theory, such as the ideas of measure, measurable functions, and so on. 
Roughly, the first seven chapters of Measure Theory by Paul Halmos [64] is 
sufficient background. There is an appendix which lists the essential 
definitions and theorems. This should be taken as a rapid review or outline for 
study rather than as an exposition. No prior knowledge of probability is 
assumed, but browsing through an elementary book such as the one by William 
Feller (59, Vol. I], with its diverse and vivid examples, gives an excellent 
feeling for the subject. 

Probability theory has a right and a left hand. On the right is the rigorous 
foundational work using the tools of measure theory. The left hand “thinks 
probabilistically,” reduces problems to gambling situations, coin-tossing, 
motions of a physical particle. I am grateful to Michel Loéve for teaching me 
the first side, and to David Blackwell, who gave me the flavor of the other. 

David Freedman read through the entire manuscript. His suggestions 
resulted in many substantial revisions, and the book has been considerably 
improved by his efforts. Charles Stone worked hard to convince me of the 
importance of analytic methods in probability. The presence of Chapter 10 is 
largely due to his influence, and I am further in his debt for reading parts of the 
manuscript and for some illuminating conversations on diffusion theory. 

Of course, in preparing my lectures, I borrowed heavily from the existing 
books in the field and the finished product reflects this. In particular, the books 
by M. Loéve [108], J. L. Doob [39], E. B. Dynkin [43], and K. Ito and H. P. 
McKean [76] were significant contributors. 

Two students, Carl Maltz and Frank Kontrovich, read parts of the 
manuscript and provided lists of mistakes and unreadable portions. Also, I was 
blessed by having two fine typists, Louise Gaines and Ruth Goldstein, who 
rose above mere patience when faced with my numerous revisions of the “final 
draft.” Finally, I am grateful to my many nonmathematician friends who 
continually asked when I was going to finish “that thing,” in voices that could 
not be interminably denied. 

Leo Breiman 
Topanga, California 
January, 1968 
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CHAPTER | 


INTRODUCTION 


A good deal of probability theory consists of the study of limit theorems. 
These limit theorems come in two categories which we call strong and weak. 
To illustrate and also to dip into history we begin with a study of coin- 
tossing and a discussion of the two most famous prototypes of weak and 
strong limit theorems. 


1. n INDEPENDENT TOSSES OF A FAIR COIN 


These words put us immediately into difficulty. What meaning can be 
assigned to the words, coin, fair, independent? Take a pragmatic attitude—all 
computations involving n tosses of a fair coin are based on two givens: 


a) There are 2" possible outcomes, namely, all sequences n-long of the two 
letters H and T (Heads and Tails). 


b) Each sequence has probability 27”. 


Nothing else is given. All computations regarding odds, and so forth, in 
fair coin-tossing are based on (a) and (b) above. Hence we take (a) and (b) 
as being the complete definition of n independent tosses of a fair coin. 


2. THE “LAW OF AVERAGES” 


Vaguely, almost everyone believes that for large n, the number of heads is 
about the same as the number of tails. That is, if you toss a fair coin a large 
number of times, then about half the tosses result in heads. 

How to make this mathematics? All we have at our disposal to mathe- 
matize the “law of averages” are (a) and (b) above. So if there is anything 
at all corresponding to the law of averages, it must come out of (a) and (b) 
with no extra added ingredients. 

Analyze the 2” sequences of H and T. In how many of these sequences 
do exactly k heads appear? This is a combinatorial problem which clearly 
can be rephrased as: Given n squares, in how many different ways can we 
distribute k crosses on them? (See Fig. 1.1.) For example, if = 3, k = 2, 
then we have the result shown in Fig. 1.2, and the answer is 3. 

To get the answer in general, take the k crosses and subscript them so 
they become different from each other, that is, +4, +2,..., +x- Now we 
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n squares 

E EE 

EEE ESET 
k crosses 
Figure 1.1 Figure 1.2 


may place these latter crosses in n squares in n(n — 1): +: (n — k + 1) ways 
[+1 may be put down in n ways, then +, in (n — 1) ways, and so forth]. 
But any permutation of the k subscripted crosses among the boxes they 
occupy gives rise to exactly the same distribution of unsubscripted crosses. 
There are k! permutations. Hence 


Proposition 1.1. There are exactly 

n! 
— k!(n— k)! 
sequences of H, T, n-long in which k heads appear. 


nC 


Simple computations show that if n is even, ,C, is a maximum for 
k = n/2 and if n is odd, „C, has its maximum value at k = (n — 1)/2 and 
k = (n + 1)/2. 


Stirling’s Approximation (59, Vol. I, pp. 50 ff.] 
(1.2) n! = en" V2mn(1 + €,), 
where e„, — 0 as n — œ. 

We use this to get 


(1.3) ies (2n)! _ lny 4nn 


nin! e-?"n®"(2an) 


(1 + ôn) 


1 
= 2". — (1 + Ôn), 
Ta ( ) 
where 6, > 0 asn— œ. 

In 2n trials there are 2?" possible sequences of outcomes H, T. Thus 
(1.3) implies that k = n for only a fraction of about iv mn of the sequences. 
Equivalently, the probability that the number of heads equals the number of 


tails is about 1 W am for n large (see Fig. 1.3). 


Conclusion. As n becomes large, the proportion of sequences such that 
heads comes up exactly n/2 times goes to zero (see Fig. 1.3). 


Whatever the “law of averages” may say, it is certainly not reasonable 
in a thousand tosses of a fair coin to expect exactly 500 heads. It is not 


1.2 THE “LAW OF AVERAGES” 3 


0123 n 


Figure 1.3 Probability of exactly k heads in 2n tosses. 


possible to fix a number M such that for n large most of the sequences have 
the property that the number of heads in the sequence is within M of n/2. 
For 2n tosses this fraction of the sequences is easily seen to be less than 


2M mn (forgetting ô„) and so becomes smaller and smaller. 
To be more reasonable, perhaps the best we can get is that usually the 
proportion of heads in n tosses is close to $. More precisely— 


Question. Given any e > 0, for how many sequences does the proportion 
of heads differ from } by less than e? 


The answer to this question is one of the earliest and most famous of the 
limit theorems of probability. Let N(n, €) be the number of sequences 
n-long satisfying the condition of the above question. 


Theorem 1.4. lim, 2~" N(n, e) = 1. 


In other words, the fraction of sequences such that the proportion of heads 
differs from 4 by less than « goes to one as n increases for any « > 0. 


This theorem is called the weak law of large numbers for fair coin tossing. 
To prove this theorem we need to show that 


(1.5) lim| > 5 c] wi 


n 
n |2” k;|k/n—1/2] < € 


Theorem 1.4 states that most of the time, if you toss a coin n times, the 
proportion of heads will be close to 4. Is this what is intuitively meant by 
the law of averages? Not quite—the abiding faith seems to be that no matter 
how badly you have done on the first n tosses, eventually things will settle down 
and smooth out if you keep tossing the coin. 

Ignore this faith for the moment. Let us go back and establish some 
notation and machinery so we can give Theorem 1.4 an interesting proof. 
One proof is simply to establish (1.5) by direct computation. It was done this 
way originally, but the following proof is simpler. 
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Definition 1.6 


a) Let Q, be the space consisting of all sequences n-long of H, T. Denote 
these sequences by w1, ®:,..., ®y, N = 2". 


b) Let A, B, C, and so forth, denote subsets of Q,. The probability P(A) of 
any subset A is defined as the sum of the probabilities of all sequences in A, 
that is, 


P(A) = 2-" (number of sequences in A), 
equivalently, P(A) is the fraction of the total number of sequences that are 
in A. 


For example, one interesting subset of Q, is the set A, of all sequences such 
that the first member is H. This set can be described as “‘the first toss results 
in heads.” We should certainly have, if (b) above makes sense, P(A,) = 4. 


This is so, because there are exactly 2*-! members of Q,, whose first member is 
H. 


c) Let X(w) be any real-valued function on Q,. Define the expected value of 
X as 


Note that the expected value of X is just its average weighted by the prob- 
ability. Suppose X(w) takes the value x, on the set of sequences A,, x, on Ag, 
and so forth; then, of course, 


EX = J x,P(A)). 
And also note that EX is an integral, that is, 
E(aX + BY) = «EX + BEY, 


where a, 6 are real numbers, and EX > 0, for X > 0. Also, in the future 
we will denote by {w; ---} the subset of Q, satisfying the conditions 
following the semicolon. 

The proof of 1.4 will be based on the important Chebyshev inequality. 


Proposition 1.7. For X(w) any function on Q,, and any e > 0, 


P(w; |X(o)| >) < 5 E(x). 


Proof 
P(w; |X| >.) = + nimbe ofa; |X| >o= 5 L 
2° to; |X] ><} 2” 
X% w) 1 1 2 1 1 5 
< Se Xw): — = — EX’. 
Tokio €? 2” s é 2. (o 2” è 
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Define functions X,(w),..., X,(@), S,(@) on Q, by 


1 if jth member of w is H, 
Xo) = . : 
0 if jth member ofw is T, 
(1.8) 


S,(@) = Xw) + +++ + X,(@), 


so that S,,(w) is exactly the number of heads in the sequence w. For practice, 
note that 


EX, = 0: P(@; first toss = T) + 1-P(o; first toss = H) = 4, 
EX,X, = 0: P(w; either first toss or second toss = T) 
+ 1-P(w; both first toss and second toss = H) = } 


(since there are 2"-* sequences beginning with HH). Similarly, check that 
if i Æ j, then 
4 on 2%" sequences, 
(X; — XX; -— ) = | 

—} on 2” sequences, 

so that 
E(X; — )(%; - 4) = 0, i#žj. 
Also, 
X,- D=} > XK, -PFL 

Finally, write 


so that E 

(1.9) E (= a 5) x aE(E% — HX, — D) 
=i izn L 
= EK -P = 


Proof of Theorem 1.4. By Chebyshev’s inequality, 
= 2 
P(o; Slo) _ ' 3 ‘ < MS = 
n 


€ 
Use (1.9) now to get 


S,(@) 1 1 
1.10 Plo; == — -|> ria 
(Ew (e n 5| = ‘) < 4ne? 
implying 
lim P(w; 5 > «) = 0. 
n n 2 


Since P(Q,) = 1, this completes the proof. 
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Definition 1.11. Consider n independent tosses of a biased coin with probability 
p of heads. This is defined by 


a) there are 2" possible outcomes consisting of all sequences in Q,. 
b) the probability P(w) of any sequence w is given by 


P(w) = (p* of H in (gt tT in o), q= 1 — p. 
As before, define P(A), A € Q,, by P(A) = 3S P(w). For X(w) any real 
wed 
valued function on Q,, define EX =  X(w) P(w). 
wD, 


The following problems concern biased coin-tossing. 


Problems 
1. Show that P(Q,) = 1, P(A U B) < P(A) + P(B) with equality if A and 
B are disjoint. 
2. Show that 
P(w; S,(w) = k) = „Cpg. 
3. Show that Chebyshev’s inequality 1.7 remains true for biased coin- 
tossing. 
4. Prove the weak law of large numbers in the form: for any e > 0, 


lim P(o; Salo) —p 
n n 


><) =0. 


5. Using Stirling’s approximation, find an approximation to the value of 
P(Son = n). 


Definition 1.12. For w €Q,, w = (wi, ..., ,), where w; € {H, T}, call w; 
the ith coordinate of w or the outcome of the ith toss. Any subset A € Q, will 
be referred to as an event. An event A € Q, will be said to depend only on the 
i,;,..., 1,th tosses if it is of the form 


A = {@; (@;,,..., W) EE}, EC Qep k&n. 


Problems 

6. If A is of the form above, show that P(A) = P'(E), where P(w) is defined 
on Q, by 1.11(b). 

7. If A, B = Q, are such that A depends on the i,,..., i, tosses and Bon 


the jı, . . . , fm tosses and the sets (i,,...,%), (fis - - -> jm) have no common 
member, then 


P(A A B) = P(A)P(B). 


[Hint: On Problems 6 and 7 above induction works.] 
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P 
Figure 1.4 Probability of k 
heads in n tosses. 
es Bete 


k 
eI 
(n) n2 (n) 


3. THE BELL-SHAPED CURVE ENTERS (Fluctuation Theory) 


For large n, the weak law of large numbers says that most outcomes have 
about n/2 heads, more precisely, that the number of heads falls in the range 
(n/2)(1 + e) with probability almost one for n large. Pose the question, 
how large a fluctuation about n/2 is nonsurprising? For instance, if you 
get 60 H in 100 tosses, will you strongly suspect the coin of being biased? 
If you get 54? 43? and so on. Look at the graph in Fig. 1.4. What we want is 
a function ¢(n) increasing with n such that 


(1.13) P(IS, — n/2| < o(n)) +2, O<a<l. 
There are useful things we know about p(n). As the maximum height of the 


graph is order I//n we suspect that we will have to go about Jn steps on 


either side. Certainly if we put p(n) = x, V. nj2 (the 4 factor to make things 
work out later on), then lim x, > 0, otherwise the limit in (1.13) would be 


zero. By Chebyshev’s inequality, 
P(|S, — n/2| > x,Vn]2) < njnx?, = 1/x2. 


So lim x, < œ, otherwise « in (1.13) would have to be one. These two 
bounds lead to the immediate suspicion that we could take x, — x, 0 < 
x < œ. But there is no reason, then, not to try x, = x, all. First, examine 
the case for n even. We want to evaluate 


P,(x) = P(|Sen — n| < x(x/2n/2)). 
This is given by 


ae ar on 
[k—n| <2V n/2 oe 
Put k = n + j, to get 
Puy OO anCucs: 
li}<aV n/2 
(2n)! 
SS = Serre ETE 
(n+ j)! (n — j)! 
(n+jĵ)!=(n +j): (n + In, 


(n— j)! = n!/n;:(n— 1): (n— j+ 1). 
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Put 
Pa = P(S, = n) = 2-7(2n)!/n! n! 
and write 
n(n — 1) (n-j+1) 
G+p)-@+y ` 


ay anCntj = Pa 5 
Let D; „ be the second factor above, 


1 


| Ds S a E E 
+ ifn) + C 2) A+ ifn — 7 +9) 
and 


g-1 
log D;,, = — 2 log (1 + jf(n — k)). 


Use the expansion log (1 + x) = x(1 + ¢(x)), where lim, _., (x) = 0. 


order St +5): 
aa Zoa ta 


Note that j is restricted to the range R, = {j; |j| < x Vnd, so that if we 
write 


j—i 


log Dya = (i + epad Z > 
then sup €e; „ — 0. Writing 
ieRn 
j ae ape 
n—k n 1—k/n 

since 0 < k < j, we find that 

j—1 j Fi 

log D —(1 + GA) Di —(1 + €; D can , 
k=0 n n 


where again sup €,,, > 0. Also for j € R,, j?/n < x?/2, so that 
i€Rn 
Din = eP + Aja) GERw 
where sup A,,, > 0. By (1.3), 
ieRy 
P(x) =(14+6,) 5 ee, tims, = 0. 
nmn n 


JERn 


Make the changes of variable, t; = j /2/n, At = t — 4 =V 2/n, so the 
condition j € R, becomes 
-x< 4 <x. 
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Now the end is near: 


Px) =1+6) 5 Fee 


—g<tj< e T 
The factor on the right is graciously just the approximating sum for an 
integral, that is, we have now shown that 


(1.14) lim P(x) = -L [etd 
i mP = ; 


NT J—x 


To get the odd values of n take h > 0 and note that for n sufficiently large 
2n +1 


Senta 


= lo; [Sen — n| < C+D ym), 


<5 Vint i 


pis =e 8 =» m] < [o w; 


yielding 


lim P,(x — h) < lim dl 


2 peace 
Senta — eties ymi) 


<> Vint i) 


A 

F| 

y 
AAA 


Thus we have proved, as done originally (more or less), a special case of the 
famous central limit theorem, which along with the law of large numbers 
shares the throne in probability theory. 


Theorem 1.15 
+e 
<= = dt, 
avn n) = Jim =i 


lim P( ee 
There is a more standard form for this theorem: Let 


Ç 2 
(1.16) O(x) = == E et? dt 
v2 
and Z, = 2S, — n, that is, Z, is the excess of heads over tails in n tosses, 
or if 


il 


H, 
T, 


1 if oœ; 
Y,= i a 
—1 if o, 
then Z, = > Y,. From 1.15 
1 
lim (Ge < x) = (x) — O(—x). 


n 
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By symmetry, 


(0 vei x) =4P (= < x) +4P(Z, = 0), 
Jn J 
giving 
lim P(Ž2 < x) -P +1- O(—x) 
n Jn 2 
But ®(+ 0) = 1, so 1 — ®(—x) = (x), and therefore, 
Theorem 1.17 


lini P| — = W(x). 
(as?) 

Thus, the asymptotic distribution of the deviation of the number of 
heads from n/2 is governed by ®(x). That (x), the normal curve, should be 
singled out from among all other limiting distributions is one of the most 
magical and puzzling results in probability. Why ®(x)? The above proof 
gives very little insight as to what properties of P(x) cause its sudden appear- 
ance against the simple backdrop of fair coin-tossing. We return to this 
later. 


Problems 


8. Using Theorem 1.15 as an approximation device, find the smallest 
integer N such that in 1600 tosses of a fair coin, there is probability at least 
0.99 that the number of heads will fall in the range 800 + N. 


9. Show that for j even, 


P(Z,, = n)» 
-= 
where e„, —> 0 as n — 0, 
10. Show that 
(1.18) sup |y 7n P(Zan = j) — e7 +0, 
i 


where the sup is over all even j. 


11. Consider a sequence of experiments such that on the nth experiment a 
coin is tossed independently n times with probability of heads p,. If 
lim, "Pp, = 4,0 <A < œ, then letting S, be the number of heads occurring 
in the nth experiment, show that 


i 
lim P(S, = j) = Ë e7 j=0,1,2,... 
n J: 
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4. STRONG FORM OF THE “LAW OF AVERAGES” 


Definition 1.19. Let Q be the space consisting of all infinite sequences of H’s 
and T’s. Denote a point in Q by w. Let œ, be the jth letter in w, that is, œ = 
(wi, We,...). Define functions on Q as follows: 


1 if w; = H, 


Xw) = 
A 0 if w,=T, 


Ste > Xo). 


We are concerned with the behavior of S,(w)/n for large n. The intuitive 
notion of the law of averages would be that 


(1.20) fin oe a 


n n 2 

for all œ € Q. This is obviously false; the limit need not exist and even if it 
does, certainly does not need to equal 4. [Consider w = (H, H, H,...).] 
The most we can ask is whether for almost all sequences in Q, (1.20) holds. 

What about the set E of œw such that (1.20) does not hold? We would 
like to say that in some sense the probability of this exceptional set is small. 
Here we encounter an essential difficulty. We know how to assign prob- 
abilities to sets of the form 


(w; wo, = H,...,@, = T). 


Indeed, if A € Q,, we know how to assign probabilities to all sets of the 
form 

(w; (@,...,@,) E A), 
that is, simply as before 


P(w;(@,..-,@,)€ A) = 5 (number of sequences in A). 


But the exceptional set E is the set such that S,(w)/n++ 4 and so does not 
depend on @,,..., w, for any n, but rather on the asymptotic distribution 
of heads and tails in the sequence w. 

But anyhow, let’s try to push through a proof and then see what is 
wrong with it and what needs to be fixed up. 


Theorem 1.21 (Strong law of large numbers). The probability of the set of 
sequences E such that S,(w)/n +> 4 is zero. 


Proof. First note that 
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since for any n, take m such that 


m<n<(m-+ 1)? or O<n- m < 2m. 


For this m, 
Sy _ Sm? | | Sn _ Sm? (aca) 
n m? mm n m)’ 
-m 1} 2 2 4 
<l” abpa sg a 
m n m m m m 


Fix e > 0 and let E, be the set {w; lim |Sm:/m? — $| > e}. Look at the 
set Em,m, © Q of sequences such that the inequality |S,,2/m? — 4] > € 
occurs at least once for my < m < my. That is, 

Sm? — l | > d. 
m? 2 


my, 
Emm, = U lo; 


m=mo 


The set Em,m, IS a set of sequences that depends only on the coordinates 
@y,...,@,,2. We know how to assign probability to such sets, and applying 
the result of Problem 1, 

Še - : | > s). 

m 


my 
PE mam) < È P(o; 
Using Chebyshev’s inequality in the form (1.10) we get 


m=mọ 


Let m, go to infinity and note that 


Em = lim Emm, = U fo; 
mı 


m=mo 


is the set of all sequences such that the inequality |S_,2/m? — $| > «e occurs 
at least once for m > mọ. Also note that the {Em,,m,} are an increasing 
sequence of sets in m, for mọ fixed. If we could make a vital transition and 
say that 

P(E,,,) = lim P(Em,,m,)» 


then it would follow that 
2.1 
P(E m) < > = 
m 


at 
taal 4e? m=mo 
Now lim |S,,2/m? — }| > e if and only if for any m, am > m, such 
ISm/m? — }| >e. From this, E, = limp, Em, where the sets En, are 
decreasing in mọ. (The limits of increasing or decreasing sequences of sets 
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are well defined, for example, limn, Em, = ñ Emp and so forth.) If we 
could again assert as above that mon 

P(E.) = lim PE,,), 
then = 

P(E.) < lim f 5 = =, 

oF mo 4e? m=mo m? 
By definition, E is the set {w; lim |S,,2/m? — 3] > 0}, so E = lim, Eis 
k running through the positive integers, and the sets E; p increasing in k. 

Once more, if we assert that 


P(E) = lim P(E,;,), 


then since P(E,,,,) = 0, all k > 0, consequently P(E) = 0 and the theorem 
is proven. Q.E.D??? 

The real question is one of how may probability be assigned to subsets 
of Q. What we need for the above proof is an assignment of probability 
P(-) on a class of subsets F of Q such that F contains all the sets that appear 
in the above proof and such that P(-) in some way corresponds to a fair 
coin-tossing probability. More concretely, what we want are the statements 


(1.22) 


i) F contains all subsets depending only on a finite number of tosses, that is, 
all sets of the form {w; (a@,,...,@,) E A}, A € Qn n > 1, and P(-) 
is defined on these sets by 


P(w; (@,,.. . » On) E€ A) = P,,(A), 


where P, is the probability defined previously on Q,; 
ii) if A, is any monotone sequence of sets in F, then lim, A, is also in F ; 
iii) if the A, are as in (ii) above, then 


lim P(A,) = P (tim An) ; 


iv) if A, Be F are disjoint, then 
P(A U B) = P(A) + P(B). 


Of these four, one is simply the requirement that the assignment be con- 
sistent with our previous definition of independent coin-tossing. Two and 
three are exactly the statement of what is needed to make the transitions in 
the proof of the law of large numbers valid. Four is that the assignment 
P(-) continue to have on Q the property that the probability assignment has 
on Q, and whose absence would seem intuitively most offensive, namely, 
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that if two sets of outcomes are disjoint, then the probability of getting into 
either one or the other is the sum of the probabilities of each one. 

Also, is the assignment of P(-) unique in any sense? If it is not, then we 
are in real difficulty. We can put the above questions into more amenable 
form. Let F, be the class of all subsets of Q depending on only a finite 
number of tosses, then 


Proposition 1.23. F, is a field, where 


Definition 1.24. A class of subsets C of a space Q is a field if it is closed under 
finite unions, intersections, and complementation. The complement of Q is the 
empty set Ø. 


The proof of (1.23) is a direct verification. For economy, take ¥ to be the 
smallest class of sets containing F, such that F has property (1.22ii). That 
such a smallest class exists can be established by considering the sets common 
to every class of sets containing F, satisfying (1.22ii). But (see Appendix A), 
these properties imply 


Proposition 1.25. F is the smallest o-field containing Fo, where 


Definition 1.26. A class of subsets F of Q is a o-field if it is closed under 
complementation, and countable intersections and unions. For any class C of 
subsets of Q, denote by F(C) the smallest o-field containing C. 


Also 


Proposition 1.27. P(-) on F satisfies (1.22) iff P(-) is a probability measure, 
where 


Definition 1.28. A nonnegative set function P(-) defined on a o-field F of 
subsets of Q is a probability measure if 


i) (normalization) P(Q) = 1; 
ii) (o-additivity) for every finite or countable collection {B,} of sets in F such 
that B, is disjoint from B,, k # j, 


P(U Bi) = 2 P(B,). 
Proof of 1.27. If P(-) satisfies (1.22), then by finite induction on (iv) 

P(U B,) = È PCB) 
Let A, = U B,, then the A, are a monotone sequence of sets, lim A, = 
U B,. By (1.224%), 


P(U B,| = lim (3 PB») = ZPB». 


k 
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Conversely, if P(-) is o-additive, it implies (1.22). For if the {A,} are a 

monotone sequence, say A, © A,,,, we can let B, = 4, — A, 1, k > 1, 

B, = A,. The {B,} are disjoint, and lim, A, = U B,. Thus o-additivity gives 
k 


P (lim An) = Y P(B,). 
n k 
Use UB, = A, P(A,) = > P(B,) to get 

1 1 


lim P(A,) = lim (> PB») = ¥ P(B,). 
n n 1 k 
The starting point is a set function with the following properties: 


Definition 1.29. A nonnegative set function P on a field F, is a finite probabil- 
ity measure if 

i) P(Q) = 1; 

li) for A, B € Fo, and disjoint, 


P(A U B) = P(A) + P(B). 


Now the original question can be restated in more standard form: 
Given the finite probability measure P(-) defined on Fe by (1.22i), does there 
exist a probability measure defined on F and agreeing with P(-) on Fa? 
And in what sense is the measure unique? The problem is seen to be one of 
extension—given P(-) on Fo, is it possible to extend the domain of definition 
of P(-) to F such that it is o-additive? But this is a standard measure theo- 
retical question. The surprise is that the attempt to patch up the strong law 
of large numbers has led directly to this well-known problem (see Appendix 


A.9). 


5. AN ANALYTIC MODEL FOR COIN-TOSSING 


The fact that the sequence X,(w), X,(w), . . . comprised functions depending 
on consecutive independent tosses of a fair coin was to some extent immaterial 
in the proof of the strong law. For example, produce functions on a different 
space Q’ this way: Toss a well-balanced, six-sided die independently, let 
Q' be the space of all infinite sequences w = (w}, w,,...), where œ, takes 
values in (1,..., 6). Define X’ (w) to be one if the nth throw results in an 
even face, zero if in an odd face. The sequence X}(w’), X;(w’), . .. has the 
same probabilistic structure as X,(w), ... in the sense that the probability 
of any sequence n-long of zeros and ones is 1/2” in both models (with the 
appropriate definition of independent throws of a well-balanced die). But 
this assignment of probabilities is the important information, rather than the 
exact nature of the underlying space. For example, the same argument 
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leading to the strong law of large numbers holds for the variables X{, Xj, .... 
Therefore, in general, we will consider as a model for fair coin-tossing any 
set of functions X,, Xz, . . ., with values zero or one defined on a space Q 
of points w such that probability 1/2” is assigned to all sets of the form 


{w; Xi = 5,...,%, = Sy} 


for s,..., 5, any sequence of zeros and ones. 

An interesting analytic model can be constructed on the half-open 
unit interval Q = [0, 1). It can be shown that every number x in [0, 1) 
has a unique binary expansion containing an infinite number of zeros. The 
latter restriction takes care of binary rational points which have two ex- 
pansions, that is 


1 1,0 0 
-=01000°--=-+-—+5+°°: 
2 2 a 27 a 2° $ 
O, 1 1 
= 0.0111- = 2+ tat: 
3 + 3 + 5 + 
Now for any x e [0, 1) write down this expansion x = .x,x,°-+ and define 
(1.30) X,(x) = Xp. 
That is, X„(x) is the nth digit in the expansion of x (see Fig. 1.5). 
x (x) i X2(x) 
a ee ee 
2 4 2 4 


To every interval Z = [0, 1), assign probability P(7) = ||/||, the length 
of J. Now check that the probability of the set 


{x5 Xi = S.. Xa = Sn} 
is 1/2”, because this set is exactly the interval 


S, . Sg Sn S. S Sn 1 
E ia aa ae ee ee og a ee a 1, 
pa ree 2 et 2” j 2" 
Thus, X,(x), X,(x),... on [0, 1) with the given probability is a model for 
fair coin-tossing. 
The interest in this particular model is that the extension of P is a classical 
result. The smallest o-field containing all the intervals 


{x; X(x) = Shee’ X,(x) = Sp} 
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is the Borel field B,((0, 1)) of subsets of [0, 1), and there is a unique extension 
of P to a probability on this field, namely, Lebesgue measure. The theorem 
establishing the existence of Lebesgue measure makes the proof of the 
strong law of large numbers rigorous for the analytic model. The statement 
in this context is: 


Theorem 1.31. For almost all x €[0, 1) with respect to Lebesgue measure, the 
asymptotic proportions of zero’s and one’s in the binary expansion of x is }. 


The existence of Lebesgue measure in the analytic model for coin- 
tossing makes it plausible that there is an extended probability in the original 
model. This is true, but we defer the proof to the next chapter. 

Another way of looking at the analytic model is to say that the binary 
expansion of a number in [0, 1) produces independent zeros and ones with 
probability } each with respect to Lebesgue measure. Thus, as Theorem 
1.31 illustrates, any results established for fair coin-tossing can be written as 
theorems concerning functions and numbers on and in [0, 1) with respect 
to Lebesgue measure. Denote Lebesgue measure from now on by dx or 
(dx). 

Problem 12. (The Rademacher Functions). Let 
( ) -1, x; = 1, 
Ax) = 
2 +1, Xi = 0, 


where x = .x,x,°°: is the unique binary expansion of x containing an 
infinite number of zeros. Show from the properties of coin-tossing that if 
i, # i; forj Ak, 


f i Yax) dx = 0. 


Graph y(x), y4(x). Show that the sequence of functions {y,(x)} is ortho- 
normal with respect to Lebesgue measure. 


6. CONCLUSIONS 


The strong law of large numbers and the central limit theorem illustrate the 
two main types of limit theorems in probability. 


Strong limit theorems. Given a sequence of functions Y,(@), Ya(%w), . . . there is 
a limit function Y(w) such that P(w; lim, Y,(@) = Y(w)) = 1. 

Weak limit theorems. Given a sequence of functions Y,(w), Y.(w), ... show 
that 


lim P(w; Yaw) < x) 


exists for every x. 
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There is a great difference between strong and weak theorems which will 
become more apparent. We will show later, for instance, that Z,/Vn has 
no limit in any reasonable way. A more dramatic example of this is: on 
([0, 1), B,([0, 1))) with P being Lebesgue measure, define 


0, y <3, 
Y,0) = 
1, 3 <y < l, 
for n even. For n odd, 
l, y<ł, 
Y,()) = i 
0, 4<y<l. 


For all n, P(y; Y,(y) < x) = P(y; Yı(y) < x). But for every y e [0, 1) 
lim Y,(y)=1, lim Y,(y) = 0. 


n n 
To begin with we concentrate on strong limit theorems. But to do this we 
need a more firmly constructed measure theoretic foundation. 


NOTES 


To get some of the fascinating interplay between probability and number 
theory, refer to Mark Kac’s monograph [83]. 

Although there will be very little subsequent work with combinatorics 
in this text, they occupy an honored and powerful place in probability theory. 
First, for many of the more important theorems, the original version was 
for independent fair coin-tossing. Even outside of this, there are some strong 
theorems in probability for which the most interesting proofs are combina- 
torial. A good source for these uses are Feller’s books [59]. 

An elegant approach to the measure theoretic aspects of probability 
can be found in Neveu’s book [113]. 


CHAPTER 2 


MATHEMATICAL FRAMEWORK 


1, INTRODUCTION 


The context that is necessary for the strong limit theorems we want to prove 

is: 

Definition 2.1. A probability space consists of a triple (Q, F, P) where 

i) Q isa space of points w, called the sample space and sample points. 

ii) F is a o-field of subsets of Q. These subsets are called events. 

iii) P(-) is a probability measure on F; henceforth refer to P as simply a 
probability. 

On Q there is defined a sequence of real-valued functions X,(), Xo(@),... 

which are random variables in the sense of 


Definition 2.2. A function X(w) defined on Q is called a random variable if for 
every Borel set B in the real line R™, the set {w; X(w) € B} isin F. (X(w) isa 
measurable function on (Q, ¥).) 


Whether a given function is a random variable, of course, depends on the 
pair (Q, F). The reason underlying 2.2 is that we want probability assigned to 
all sets of the form {w; X(w) € I}, where J is some interval. It will follow 
from 2.29 that if {@; X(w) € I} is in F for all intervals J, then X must be a 
random variable. 


Definition 2.3. A countable stochastic process, or process, is a sequence of 
random variables X,, Xo, . . . defined on a common probability space (Q, F, P). 


But in a probabilistic model arising in gambling or science the given 
data are usually an assignment of probability to a much smaller class of 
sets. For example, if all the variables X,, X3, . . . take values in some count- 
able set F, the probability of all sets of the form 
(2.4) {o; X,=5,...,%, =5,}, EF, k =1,...,n 
is usually given. If the X,, X, . . . are not discrete, then often the specifica- 
tion is for all sets of the form 
(2.5) {w; X,Eh,...,X%, E lnh 


where /,,..., J, are intervals. 
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To justify the use of a probability space as a framework for probability 
theory it is really necessary to show that a reasonable assignment of prob- 
abilities to a small class of sets has a unique extension to a probability P on 
a probability space (Q, F, P). There are fairly general results to this effect. 
We defer this until we have explored some of the measure-theoretic properties 
of processes. 


2. RANDOM VECTORS 


Given two spaces Q and R, let X be a function on Q to R, X: Q>R. 
The inverse image under X of a set B € Ris {w; X € B}. We abbreviate this 
by {X € B}. 


Proposition 2.6. Set operations are preserved under inverse mappings, that is, 


xeU a =U {XeB,}, 


xen B.) =N {XeB,}, 
a a 
{X € B°} = {Xe B}’. 
(B° denotes the complement of the set B.) 
Proof. By definition. 
This quickly gives 


Proposition 2.7. If X: Q— R,and Bisa o-field in R, the class of sets {X€ B}, 
Be B, is a c-field. If F is a o-field in Q, then the class of subsets B in R such 
that {X € B} Ee F is a o-field. 


Proof. Both assertions are obvious from 2.6. 


Definition 2.8. If there are o-fields F and B, in Q, R respectively, X: Q —> R 
is called a random vector if {X € B} € F, for all B € B. (X is a measurable map 
from (Q, F) to (R, B).) 


We will sometimes refer to (R, B) as the range space of X. But the range of X 
is the direct image under X of Q, that is, the union of all points X(w), w € Q. 
Denote by F(X) the o-field of all sets of the form {X € B}, Be B. 


Definition 2.9. IfA isa o-field contained in F , call X A-measurable if F(X) € A. 


If there is a probability space (Q, F, P) and X is a random vector with 
range space (R, B), then Ê can be naturally defined on B by 


P(B) = P(X € B). 


It is easy to check that Ê defined this way is a probability on B. 
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Definition 2.10. P is called the probability distribution of the random vector X. 


Conversely, suppose X is a random vector on (Q, F) to (R, B) and 
there is a probability distribution Ê defined on B. Since every set in F(X) 
is of the form {X € B}, B € K, can P be defined on F(X) by 


(2.11) P(X € B) = PB)? 


The answer, in general, is no! The difficulty is that the same set A € F(X) 
may be represented in two different ways as {X € B,} and {X € B,}, and there 
is no guarantee that P(B,) = P(B,). What is true is 


Proposition 2.12. Let F be the range of X. If B € B, B < F° implies P(B) = 0, 
then P is uniquely defined on F(X) by 2.11, and is a probability. 


Proof. If A = {X € B,} = {X € B,}, then B, — B, and B, — B, are both 
in F°. Hence P(B,) = P(B,). The o-additivity is quickly verified. 


Problem 1. Use 2.12 and the existence of the analytic model for coin-tossing 
to prove the existence of the desired extension of P in the original model. 


3. THE DISTRIBUTION OF PROCESSES 


Denote by R‘? the space consisting of all infinite sequences (x,, Xz, . . .) 
of real numbers. In R‘*? an n-dimensional rectangle is a set of the form 


{xe R®; x,Eh,...,x, €],}, 


where [,,..., J, are finite or infinite intervals. Take the Borel field 3, 
to be the smallest o-field of subsets of R‘°) containing all finite-dimensional 
rectangles. 

If each component of X = (X;, . . .) is a random variable, then it follows 
that the vector X is a measurable mapping to (R, Bao). In other words, 


Proposition 2.13. If X,, X,... are random variables on (Q, F), then for 
X = (Xj, Xz, . . . ) and every BE B,, {KE BleF. 


Proof. Let S be a finite-dimensional rectangle 


S = {xe Rs x, €h,...,%_ E In} 
Then 
{Xe S} = {KEL} A {Xe} NA A {Xn E L}. 


This is certainly in F. Now let C be the class of sets C in B,, such that 
{XeC}eF. By 2.7 Cis a o-field. Since C contains all rectangles, C = Ba. 


If all that we observe are the values of a process X,(w), X,(w),... the 
underlying probability space is certainly not uniquely determined. As an 
example, suppose that in one room a fair coin is being tossed independently, 
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and calls zero or one are being made for tails or heads respectively. In 
another room a well-balanced die is being cast independently and zero or one 
called as the resulting face is odd or even. There is, however, no way of 
discriminating between these two experiments on the basis of the calls. 

Denote X = (X,,...). From an observational point of view, the thing 
that really interests us is not the space (Q, F, P), but the distribution of the 
values of X. If two processes, X on (Q, F, P), X’ on (Q’, F’, P’) have the same 
probability distribution, 


(2.14) P(B) = P(X € B) = P(X' e B), allBe Bu, 


then there is no way of distinguishing between the processes by observing 
them. 


Definition 2.15. Two processes {X,,} on (Q, F, P) and {X/} on (Q', F’, P’) will 
be said to have the same distribution if (2.14) holds. 


The distribution of a process contains all the information which is 
relevant to probability theory. All theorems we will prove depend only on 
the distribution of the process, and hence hold for all processes having that 
distribution. Among all processes having a given distribution Ê on Be, 
there is one which has some claim to being the simplest. 


Definition 2.16. For any given distribution P define random variables KX, Shai 


on (RI, Bros PY by K(x, x )=x 
n Ag+. J = n° 


This process is called the coordinate representation process and has the same 
distribution as the original process. 
This last assertion is immediate since for any B E€ Ba, 
P(x; (Xi, Ra...) € B) = P(x; x€ B) = PB). 


This construction also leads to the observation that given any probability 
Pon Ba, there exists a process X such that P(X € B) = P(B). 

Define the Borel field B, in R™ as the smallest o-field containing all 
rectangles {(x,,...,%,)3 41 €4,...,%, € lnh h,..., 7, intervals. 


Definition 2.17. An n-dimensional cylinder set in R'®? is any set of the form 
{x e R; (x,,..., x,) E B}, BE Ba 


Problems 
2. Show that the class of all finite-dimensional cylinder sets is a field, but not 
a o-field. 


3. Let F be a countable set, F = {r;}. Denote by F'™ the set of all infinite 
sequences with coordinates in F. Show that X: Q — F is a random variable 
with respect to (Q, F) iff {w; X(w) = r} € F, all j. 
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4. Given two processes X, X’ such that both take values in F'*), show that 
they have the same distribution iff 


P(X, = hs. ees Xn = ty) = P(X, = tn... X = tp) 


for every n-sequence (t, . . . , tn), te E {rj}, k = 1,..., 7. 


4. EXTENSION IN SEQUENCE SPACE 


Given the concept of the distribution of a process, the extension problem can 
be looked at in a different way. The given data is a specification of values 
P(X € B) for a class of sets in Ba. That is, a set function P is defined for a 
class of sets C € B~, and P(B) is the probability that the observed values of 
the process fall in Be ©. Now ask: Does there exist any process whose 
distribution agrees with Ê on C? Alternatively—construct a process X such 
that P(X € B) = P(B) for all Be C. This is equivalent to the question of 
whether Ê on C can be extended to a probability on Be. Because if so, the 
coordinate representation process has the desired distribution. As far as the 
original sample space is concerned, once Ê on S,, is gotten, 2.12 can be used 
to get an extension of P to F(X), if Ê assigns probability zero to sets B € By 
falling in the complement of the range of X. 

Besides this, another reason for looking at the extension problem on 
(R®, Ba) is that this is the smoothest space on which we can always put a 
process having any given distribution. It has some topological properties 
which allow nice extension results to be proved. 

The basic extension theorem we use is the analog in B of the extension 
of measures on the real line from their values on intervals. Let C be the class 
of all finite-dimensional rectangles, and assume that Ê is defined on C. 
A finite-dimensional rectangle may be written as a disjoint union of finite- 
dimensional rectangles, for instance, 


{xe R®; x, € [0, 1)} = {x € R™; x € [0, D} U {x e R; x € fk, 1D}. 
Of course, we will insist that if a rectangle S is a finite union (J, S; of disjoint 
rectangles, then P(S) = Z, P(S,). But an additional regularity condition is 


required, simply because the class of finite probabilities is much larger than 
the class of probabilities. 


Extension Theorem 2.18. Let P be defined on the class C of all finite-dimensional 

rectangles and have the properties: 

a) PC) > 0, P(R') = 1; 

b) if {S;}, j = 1,..., m, are disjoint n-dimensional rectangles and 
S=US, 


g=1 


P(S) = $ Aas 


is a rectangle, then 
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c) if {S,} are a nondecreasing sequence of n-dimensional rectangles, and S; Î S, 
lim P(S,) = P(S). 
i 


Then there is a unique extension of P to a probability on Ba. 


Proof. As this result belongs largely to the realm of measure theory rather 
than probability, we relegate its proof to Appendix A.48. 


Theorem 2.18 translates into probability language as: If probability 
is assigned in a reasonable way to rectangles, then there exists a process 
Xi, X,... such that P(X, e/,..., X, €/,) has the specified values. 

If the probabilities are assigned to rectangles, then in order to be well 
defined, the assignment must be consistent. This means here that since an 
n-dimensional rectangle is also an (n + 1)-dimensional rectangle (take 
Iı = R), its assignment as an (n + 1)-dimensional rectangle must agree 
with its assignment as an n-dimensional rectangle. 

Now consider the situation in which the probability distributions of all 
finite collections of random variables in a process are specified. Specifically, 
probabilities Ê, on B,, = 1, 2,..., are given and Ê is defined on the class 
of all finite-dimensional cylinder sets (2.17) by 


P(x € R®; (x,,...,X,) € B) = Ê (B), BEB, 


In order for Ê to be well-defined, the Ê, must be consistent—every n- 
dimensional cylinder set is also an (n + 1)-dimensional cylinder set and must 
be given the same probability by P, and Ên,- 


Corollary 2.19. (Kolmogorov extension theorem). There is a unique extension 
of P to a probability on Ba. 


Proof. Ê is defined on the class C of all finite-dimensional rectangles and is 
certainly a finite probability on C. Let S*, S* be rectangles in R™, S* f S*, 
then 

P(x e R®; (xy,...,%,) E SP) = Ê (S*). 
Since P,, is a probability on B,, it is well behaved under monotone limits 
(1.27). Hence P(S*) + P,(S*), and Theorem 2.18 is in force. 


The extension requirements become particularly simple when the 
required process takes only values in a countable set. 


Corollary 2.20. Let F= R™ be a countable set. If p(s,,..., 8,) is specified 
for all finite sequences of elements of F and satisfies 


PC) 20, 
2 plsi» cae J Snai) = PS, ae eg: Sn) 
> p(s,) = 1, 


yer 
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then there exists a process X,, Xa, ... such that 
P(X, = 5,..., Xn = Sy) = plsi -o © Sy). 
Proof. Lets denote an n-tuple with coordinates in F. For any B € B,, define 


P,(B) = 2 p(s). 


It is easy to check that Ê, is a finite probability on B,. Furthermore, B, t B 
implies lim P,(B,) = P,(B). 
k 


Thus we conclude (see 1.27) that Ê, is a probability on %,. The Ê, are clearly 
consistent. Now apply 2.19 to get the result. 

The extension results in this section are a bit disquieting, because even 
though the results are purely measure-theoretic, the proofs in the space 
(R, Ba) depend essentially on the topological properties of Euclidean 
spaces. This is in the nature of the problem. For example, if one has an 
infinite product (Q'), F) of (Q, F) spaces, that is: 


QI) = {(ay,...); Op E QJ}; 
F . the smallest o-field containing all sets of the form 
{(@,,...)3 0 E Aeee On E Ánh Ay,...,4, © F3 


P, a probability on all n-dimensional cylinder sets; and the set {P,„} con- 
sistent; then a probability P on Fa agreeing with P, on n-dimensional 
cylinder sets may or may not exist. For a counter-example, see Jessen and 
Andersen [77] or Neveu [113, p. 84]. 


Problem 5. Take Q = (0, 1), 


1 O<x< I1j/n, 
X,(x) = 
0, Ijn<x<l. 
Let Fo = Us, F(%,..., Xn). Characterize the sets in Fo, F(F,). For 


Be Ba, assign 


1, if(1,1,..., I €B, 
P(O XE) =| 


0, otherwise. 


Prove that P is additive on #, but that there is no extension of P to a prob- 
ability on F(F,). 


5. DISTRIBUTION FUNCTIONS 
What is needed to ensure that two processes have the same distribution? 


Definition 2.21. Given a process {X,} on (Q, F, P), define the n-dimensional 
distribution functions by 


FQ, +s Xn) = P(X < Xi oo e Xa < Xn) 
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The functions F,,(-) are real-valued functions defined on R. Denote these 
at times by F,(x,) or, to make their dependence on the random variables 
explicit, by Fy, x (4 -+ -3 Xn) or Fy (Xn). 


Theorem 2.22. Two processes have the same distribution iff all their distribution 
functions are equal. 


The proof of 2.22 follows from a more general result that we want on the 
record. 


Proposition 2.23. Let Q, Q' be two probabilities on (Q, F). Let C be a class 
of sets such that A, BEC > A N BEC, and ¥ = F(C). Then Q = Q’ on 
C implies that Q = Q' on F. 


There seems to be a common belief that 2.23 is true without the hypothesis 
that C be closed under N. To disprove this, let Q = {a, b, c, d}, Q,(a) = 
Q1(d) = Q(b) = Q2(c) = å, and Q,(b) = Q.(c) = Qa) = Qd) = 3. F is 


the class of all subsets of Q, and 
C= {aUb,dUcauc,b Ud}. 


Proof. Let F(C) be the smallest field containing C. By the unique extension 
theorem it suffices to show that Q = Q’ on F(C). Let D be the smallest 
class of sets such that 


i) Cc QO, 
i) QED, 
iii) A,BED, BC A>A— BED. 


Then D = F(C). To see this, let U be the class of sets A in D such that 
A MCe®MD forall C eC. Then notice that U satisfies (i), (ii), (iii) above, so 
UW =D. This implies that A N CED for all AED, CEC. Now let 
& be the class of sets E in D such that A N EeD, all AED. Similarly & 
satisfies (i), (ii), (ili), so 6 = D. This yields D closed under N, but by (ii), 
(ili), D is also closed under complementation, proving the assertion. Let 
G be the class of sets G in F such that Q(G) = Q’(G). Then $ satisfies (i), 
(ii), (iti) >D c G or F(C) = $. 


Returning to the proof of 2.22. Let P, P’ be defined on Ba by P(X € B), 
P'(X' € B), respectively. Let C € &,, be the class of all sets of the form 
C = {X; Xi < Yr... Xn < Yn}. Then clearly C is closed under A, and 
F(C) = Bo. Now P(C) = Fx Ois- -> Yn) and P(C) = Fx“. +. Yn), 
so that P = P’ on C by hypothesis. By 2.23 Ê = P’ on Ba. 

Another proof of 2.22 which makes it more transparent is as follows: 
For any function G(x,,...,x,) on R™ and J an interval [a, b), a < b, 
X = (x1,...,X,), write 


A 1GO(X) = G(X...  Xpay Dy Xiri» oe +o Xn) 


=: G(x, oy Xk-1s a, Xk+is s.. Xa). 
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By definition, since 
F(X, ..- X_) = P(X, <M.) Xa < Xn)s 


the probability of any rectangle {X; € h, . . . , X, E€ In} with ,..., J, left 
closed, right open, can be expressed in terms of F, by 


(2.24) P(X, €4,...,X,€1,) = Ân + Ay Fa) 
because, for J, = [a,, bn), 
DA F(X) = P(X < Xna Xa < bn) — P(X < Xia Xn < Gy) 
= P(X, < X. Xaa < Xna Xn E Jn). 


By taking limits, we can now get the probabilities of all rectangles. From 
the extension theorem 2.18 we know that specifying P on rectangles uniquely 
determines it. 

Frequently, the distribution of a process is specified by giving a set of 
distribution functions {F,(x)}, n = 1,2,... But in order that {F,(x)} be 
derived from a process {X,} on a probability space (Q, F, P), they must have 
certain essential properties. 


Proposition 2.25. The distribution functions F,,(x) satisfy the conditions: 
i) Non-negativity. For finite intervals l, = [a,, b,), k = 1,...,n, 
Â n A1,F AX) = 0. 


ii) Continuity from below. If x™ = (x, ...,x™)andx™ 1 x, j=1,...,n, 
then 
F(x) | F,,(x). 


iii) Normalization. All limits of F,, exist as 
x; T +0 or x;|)—@. 
Ifx; | — œ, then F,(x) —> 0. Ifall x; j = 1,...,2 1 +0, then F,(x) — 1. 
The set of distribution functions are connected by 
iv) Consistency 


lim Fa(Xis -o3 Xp) Os ks ae Wel: 


Xn | +o 
Proof. The proof of (i) follows from (2.24). To prove (ii), note that 
fo; X< x™,... Xp ee) os Xi < ye OG < Xat- 


Use the essential fact that probabilities behave nicely under monotone limits 
to get (ii). Use this same fact to prove (iii) and (iv); e.g. if x; | — œ, 
then {w; Xi < x,...,%, < Xn Ø. If all x,,...,x, T +0, then 


fw; X < Xs- Xp < x,} 7 Q. 
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Another important construction theorem verifies that the conditions of 
2.25 characterize the distribution functions of a process. 


Theorem 2.26. Given a set of functions {F,(x)} satisfying 2.25 (i), (ii), (ii), (iv), 
there is a process {X,,} on (Q, F, P) such that 


P(X, < xy... Xa < Xn) = F(%,- Xn). 


Proof. The idea of how the proof should go is simple. Use Q = R®, 
F = Ba, and use the coordinate representation process x X; ... We 
want to construct Ê on B, such that if S € B is a semi-infinite rectangle of 
the form 
{xe R; x < Yir... Xn < Pals 
then 
P(S) = F Oi - - + > Yn). 


To construct Ê starting from F,,, define Ê on rectangles whose sides are left 
closed, right open, intervals J,,...,/, by 


P(x; x, € h.o., Xn E l) = Ay, Ar Fal). 


Extend this to all rectangles by taking limits. The consistency 2.25 (iv) 
guarantees that P is well defined on all rectangles. All that is necessary to 
do now is to verify the conditions of 2.18. If S,, S are left closed, right open 
rectangles, and S; fî S, then the continuity from below of F,, 2.25 (ii), yields 


lim P(S) = P(S). 


To verify the above for general rectangles, use the fact that their probabilities 
can be defined as limits of probabilities of left closed, right open rectangles. 
The complication is in showing additivity of P on rectangles. It is 
sufficient to show that 
As) = LAS) 


for left closed, right open, disjoint rectangles S,,..., Sẹ whose union is a 
rectangle S. In one dimension the statement P(S) = b P(S;) follows from 
the obvious fact that for aj < a, < *** < ap 


k 
Atasay F(x) = È Atanan F (1). 
jz 


The general result is a standard theorem in the theory of the Stieltjes integral 
(McShane [I1la, pp. 245-246]). 

If a function F(x,,..., Xa) satisfies only the first three conditions of 2.25 
then Theorem 2.26 implies the following. 


2.6 RANDOM VARIABLES 29 


Corollary 2.27. There are random variables X,,..., X, on a space (Q, F, P) 
such that 

P(X, < xy, 2.25 Xn < Xn) = Fy, 2 2.» Xa). 
Hence, any such function will be called an n-dimensional distribution function. 


If a set {F,}, n= 1,2,..., of n-dimensional distribution functions 
satisfies 2.25 (iv), call them consistent. The specification of a consistent set of 
{Fa} is pretty much the minimum amount of data needed to completely 
specify the distribution of a process in the general case. 


Problems 


6. For any random variable X, let F,(x) = P(X < x). The function F(x) 
is called the distribution function of the variable X. 

Prove that F(x) satisfies 

i) x <x => Fy(x) < Fy), 

ii) lim Fy(x)=1, lim Fy(x) = 0, 


a t +00 at—o 


iii) lim Fyx(x,) = Fx(x). 
Ue t x 


7. Ifa function F(x,,...,X,) is nondecreasing in each variable separately, 
does this imply A\ es ‘A, F(x) > 0? Give an example of a function 
F(x, y) such that 

i) F(x,y) > 9, 

ii) x > x,y’ > y= Fo’, y) > Fx, y), Fy’) > Fy). 

iii) There are finite intervals J,, J, such that 


Ar Ar F, y) < 0. 


8. Let F(x), F(x), . . . be functions satisfying the conditions of Problem 6. 
Prove that the functions 


A RE JIR 


form a consistent set of distribution functions. 


6. RANDOM VARIABLES 


From now on, for reasons sufficient and necessary, we study random variables 
defined on a probability space. The sufficient reason is that the extension 
theorems state that given a fairly reasonable assignment of probabilities, a 
process can be constructed fitting the specified data. The necessity is that 
most strong limit theorems require this kind of an environment. Now we 
record a few facts regarding random variables and probability spaces. 
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Proposition 2.28. Let C be a class of Borel sets such that F(C) = B,, X a 
real-valued function on Q. If {Ke C}eF, all C eC, then X is a random 
variable on (Q, ¥). 


Proof. Let D < B, be the class of all Borel sets D such that {X e D} e F. 
D is ao-field. C € D > D = B. 


Corollary 2.29. If {X € I} €F for all intervals I, then X is a random variable. 


At times functions come up which may be infinite on some parts of Q 
but which are random variables on subsets where they are finite. 


Definition 2.30. An extended random variable X on (Q, F) may assume the 
values + œ, but {X € B} € F, for all B € B. 


Proposition 2.31. Let X be a random vector to (R, B). If (x) is a random 
variable on (R, B), then p(X) is a random variable on (Q, F), measurable F(X). 


Proof. Write, for A € B, 
{y(X) € A} = {X € py (A)} € F(X), 
(A) here denoting the inverse image of A under g. 


Definition 2.32. For random variables X,, Xz, . . . on Q, the o-field of all events 
depending on the first n outcomes is the class of sets {(X,,..., X,) E€ B}, 
Be &,. Denote this by F¥(X,,..., %,). The class of sets depending on only a 
finite number of outcomes is 


F,=UF(X%,...,X,). 
n21 


In general, Fo is a field, but not a o-field. But the fact that F(X) = F(F,) 
follows immediately from the definitions. 


Proposition 2.33. Given a process X,, Xe,.... For every set A, € F(X) and 
e > 0, there is a set A, in some F(X,,..., X,) such that 


P(A, A A.) Se. 
(A, A Ag is the symmetric set difference (A, — Az) U (Az — Aj).) 


Proof. The proof of this is one of the standard results which cluster around 
the construction used in the Carathéodory extension theorem. The statement 
is that if P on F(Fo) is an extension of P on Fy, then for every set 4, € F(F,) 
and e > 0, there is a set A, in the field F, such that P(A, A A,) < e (see 
Appendix A.12). Then 2.33 follows because F(X) is the smallest o-field 
containing 

U #(X,,...,X,,). 

n21 
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If all the random variables in a process X,, X,, . . . take values in a Borel 
set E € B4, it may be more convenient to use the range space (E, B..(£)), 
where &,,(E) consists of all sets in Ba which are subsets of E. For 


example, if X,, Xa, . . . are coin-tossing variables, then each one takes values 
in {0, 1}, and the relevant R, B for the process is 
({0, 1}°°, B..({0, 1})). 


If a random variable X has distribution function F(x), then P(X € B) 
is a probability measure on B, which is an extension of the measure on 
intervals [a, b) given by F(b) — F(a). Thus, use the notation: 


Definition 2.34. For X a random variable, denote by P(X € dx) or F(dx) the 
probability measure P(X € B) on &,. Refer to F(dx) as the distribution of X. 


Definition 2.35. A sequence X,, Xo, ... of random variables all having the 
same distribution F(dx) are called identically distributed. Similarly, call 
random vectors X,, Xs,... with the same range space (R, B) identically 
distributed if they have the common distribution 


P(dx) = P(X,, € dx). 


Problems 
9. Show that B.,({0, 1}) is the smallest o-field containing all sets of the form 


{x © {0, 1}'°); x1 = 51.2.5 Xn = Sab 


where 5,,..., Sn is any sequence n-long of zeros and ones, n = 1, 2,... 


10. Given a process Xj, Xz, ... on (Q, F, P). Let my, ms, ... be positive 
integer-valued random variables on (Q, F, P). Prove that the sequence 
Xm, Xm +++ IS a process on (Q, F, P). 


7. EXPECTATIONS OF RANDOM VARIABLES 


Definition 2.36. Let X be a random variable on (Q, F, P). Define the expecta- 
tion of X, denoted EX, by J X(w) dP(w). This is well defined if E |X| < œ. 
Alternative notations for the integrals are 


| X dP, | X(w)P(do). 
Definition 2.37. For any probability space (Q, F, P) define 
i) if AEF, the set indicator y (w) is the random variable 


l, weA, 


yaw) = | 
0, weEA'. 
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ii) If X is a random variable, then X*, XT are the random variables 


X*(w) = | X(w), weE{X > 0}, 
0, otherwise; 
—X(w), €{X < 0}, 
0, otherwise. 


A number of results we prove in this and later sections depend on a 
principle we state as 
Proposition 2.38. Consider a class £ of random variables having the properties 
i) X, Yel, «4p>O>aX+ Bret. 
ii) X, e£, X,(@) 7 X(w) for every w => X ef. 
iii) For every set A € F, y (mw) E€ Ê. 
Then £. includes all nonnegative random variables on (Q, F, P). 
Proof. See Appendix A.22. 
This is used to prove 


Proposition 2.39. Let the processes X on (Q, F, P), X’ on (Q', F’, P’) have the 
same distribution. Then if ¢(x) is measurable (R®, Bo), 


(2.40) Í, ¢(X(w))P(dev) =|. @(X'(w'))P'(do’), 


in the sense that if either side is well defined, so is the other, and the two are 
equal. 


Proof. Consider all œ for which (2.40) is true. This class satisfies (i) and 
(ii) of 2.38. Further, let B € Bo and let g(x) = x,(x). Then the two sides 
of (2.40) become P(X € B) and P’(X’ € B), respectively. But these are equal 
since the processes have the same distribution. Hence (2.40) holds for all 
nonnegative gy. Thus, for any g, it holds true for |g], yt, p-. 


Corollary 2.41. Define P(-) on By by P(B) = P(X € B). Then if œ is measur- 
able (R™, B..) and E |p(X)| < a, 


Í g(X) dP = | ¢(x)P(dx). 


Proof. {X,} on (Q, F, P) and (X,} on (R‘*), Ba, PÊ), have the same distri- 
bution, where X,, is the coordinate representation process. Thus 


Í AXXa...) dP = f e(a $a, . . )Ê(dx), 


but X,,(x) Sut 
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8. CONVERGENCE OF RANDOM VARIABLES 


Given a sequence of random variables {X,}, there are various modes of 
strong convergence of X, to a limiting random variable X. 

Definition 2.42 

i) X,„ converges to X almost surely (a.s.) if 


P(w; lim X,(@) = X(w)) = 1. 


Denote this by X,, =. X or X, > Xa.s. 
ii) X, converges to X in rth mean, for r > 0, if E|X, — X|" — 0. Denote 
this by X, — X. 
iii) X,, converges in probability to X if for every « > 0, 
P(|X,, — XI > «-) > Oas n — œ. 


P 

Denote this by X, —> X. 

The important things to notice are: 

First, all these convergences are “probabilistic.” That is, if X, X,,... has 
the same distribution as X’, X}, . . . , then X,, — X in any of the above senses 
. . , toe . . . P 
implies that X, —> X’ in the same sense. This is obvious for —> >. 
See Problem 12 for ——>. 


Secondly, Cauchy convergence in any one of these senses gives converg- 
ence. 


Proposition 2.43. If X,, — X„,— 0 in any of the above ways as m,n —> œ 
in any way, then there is a random variable X such that X, — X in the same 
way. 

Proof. Do a.s. convergence first. For all œ such that X,(@) is Cauchy 
convergent, lim,X,(w) exists. Hence P(w; lim,X,(w) exists) = 1. Let 
X(w) = lim, X,,(@) for all w such that the limit exists, otherwise put it equal 


to zero, then X,, =", X. For the other modes of Cauchy convergence the 
proof is deferred until Section 3 of the next chapter (Problems 6 and 7). 


Thirdly, of these various kinds of convergences > is usually the hardest 
to establish and more or less the strongest. To get from a.s. convergence to 


— > , some sort of boundedness condition is necessary. Recall 
Theorem 2.44. (Lebesgue bounded convergence theorem). If XY, =". Y and 


if there is a random variable Z > O such that EZ < «,and|Y,| < Z for alln, 
then EY,,—» EY. (See Appendix A.28). 
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Hence, using Y, = |X, — XJ” in 2.44, we get 
(2.45) X, 5 X, IX, < Z, EZ < œ => X, > X. 
Convergence in probability is the weakest. The implications go 


a8. P 
) — > — 


(2.46) 
vee: P 
i) — > —. 
Problems 


11. Prove (2.46i and ii). [Use a generalization of Chebyshev’s inequality 
on (ii).] 

12. Let {X,}, {X/} have the same distribution. Prove that if X, + X a.s., 
there is a random variable X’ such that X’ > X’ a.s. 


13. For a process {X,,} prove that the set 
fos; lim X„(w) does not exist} 


is an event (i.e., is in F). 


14. Prove that for X a random variable with E{X| < œ, then 4, EF, A, Ø, 
implies 
lim | XdP=0. 


n aty 


NOTES 


The use of a probability space (Q, F, P) as a context for probability theory 
was formalized by Kolmogorov [98], in a monograph published in 1933. 
But, as Kolmogorov pointed out, the concept had already been current for 
some time. 

Subsequent work in probability theory has proceeded, almost without 
exception, from this framework. There has been controversy about the 
correspondence between the axioms for a probability space and more 
primitive intuitive notions of probability. A different approach in which the 
probability of an event is defined as its asymptotic frequency is given by 
von Mises [112]. The argument can go on at several levels. At the top is 
the contention that although it seems reasonable to assume that P is a 
finite probability, there are no strong intuitive grounds for assuming it 
o-additive. Thus, in their recent book [40] Dubins and Savage assume only 
finite additivity, and even within this weaker framework prove interesting 
limit theorems. One level down is the question of whether a probability 
measure P need be additive at all. The more basic property is argued to be: 
A = B= P(A) < P(B). But, as always, with weaker assumptions fewer 
nontrivial theorems can be proven. 
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At the other end, it happens that some o-fields have so many sets in them 
that examples occur which disturb one’s intuitive concept of probability. 
Thus, there has been some work in the direction of restricting the type of 
a-field to be considered. An interesting article on this is by Blackwell [8]. 

For a more detailed treatment of measure theory than is possible in 
Appendix A, we recommend the books of Neveu [113], Loéve [108], and 
Halmos [64]. 


CHAPTER 3 


INDEPENDENCE 


Independence, or some form of it, is one of the central concepts of prob- 
ability, and it is largely responsible for the distinctive character of probability 
theory. 


1. BASIC DEFINITIONS AND RESULTS 
Definition 3.1 
(a) Given random variables X,,..., X,, on (Q, F, P), they are said to be 
independent if for any sets B,,..., By, E€ By, 
P(X, € B,,. ae Xn E Bn) = TI P(X, € B,). 
1 


(b) Given a probability space (Q, F, P) and o-fields F,,..., F „ contained in 
F, they are said to be independent if for any sets Ay E F,,..., A,EF, 


P(A, Ott O Aa) = [[ PC(A,)- 
1 
Obviously, X,,..., X„ are independent random variables iff F (X1), . . . ,F (Xn) 
are independent o-fields. 
These definitions have immediate generalizations to random vectors. 


Definition 3.2. Random vectors X,, Xe, ..., X, are said to be independent if 
the o-fields F(X), F(X2),..., F(X,), are independent. 


Virtually all the results of this section stated for independent random 
variables hold for independent random vectors. But as the generalization is 
so apparent, we usually omit it. 


Definition 3.3. The random variables X,, Xz, . . . are called independent if for 
every n > 2, the random variables X,, X,..., X,, are independent. 


Proposition 3.4. Let X,, X., . . . be independent random variables and B,, Bs,... 
any sets in B,. Then 


P(X, € By, X: € By...) = [] P(X, € B,). 
1 


3.1 BASIC DEFINITIONS AND RESULTS 37 


Proof. Let A, = {X, E B,,...,X,€B,}. Then A, are decreasing, hence 


P(lim A,,) = lim P(A,). 
But, 
lim A, = {X, E B,, X2 E By, .. .} 
and 


lim P(A,) = lim TJ P(X, € B,) = T] P(X, € B,). 
n 1 1 


Note that the same result holds for independent o-fields. 


Proposition 3.5. Let X,, X2,... be independent random variables, (i,, iz, . . .), 
(i je, - - .) disjoint sets of integers. Then the fields 


Ff, = F(X,» Xi ide ho = F(X; Xi T 
are independent. 


Proof. Consider any set D € F, of the form D = {X;, E By icy Ay E Bnb 


I 
B,€ Bı, k = 1,...,m. Define two measures Q, and Q; on F, by, for A EF}, 


Q(A) = P(A A D), Qi(A) = P(A)P(D). 
Consider the class of sets C = ¥, of the form 


C = {X, E En... X €£,}, Fe B1=1,...,0. 
Note that 


l=1 


0,(C) = p(n (Xn € E;} ñ {X,, € B,)) 


= TI PX, € E) [T P(X, € B) 
= P(C)P(D) = Q;(0). 


Thus Q, = Q; on C, C is closed under N, F(C) = F, > Q, = Q onf, 
(see 2.23). Now repeat the argument. Fix A € F, and define Q,(-), QC) 
on F, by P(A ™:), P(A)P(:). By the preceding, for any D of the form given 
above, Q,(D) = Q,(D), implying Q = Q, on F, and thus for any 4, € F,, 
A, E F, 

P(A, A Az) = P(4)P(43). 


Corollary 3.6. Let Xı, Xə, ... be independent random variables, J,, Ja, ... 
disjoint sets of integers. Then the o-fields F, = F({X;}, j € Jy) are independent. 


Proof. Assume that ¥,,..., F„ are independent. Let 


J=ÜJ and F =F(X}jEJ). 


nCe 
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Since J and J,,,, are disjoint, F’ and F,,,, satisfy the conditions of Proposition 
3.5 above. Let A, € Fy,...,A, EF, Then since F, € F',k =1,...,0, 
ANOA OA, EF". 
Let A’ = AY N +-+ NA, and Anyi EF yyy. By 3.5, 
P(Ans1 N A’) = P(A P(A) = P(A) P(n) > * P(A) 
by the induction hypothesis. 


From 3.5 and 3.6 we extract more concrete and interesting consequences. 
For instance, 3.5 implies that the fields F(X,,...,X,) and F(Xn}i-. -) 
are independent. As another example, if p1, P2, ... are measurable (R™, Bm), 
then the random variables 


Z = pı(Xı, eery Xm)» Z, = PAX mars BE Xam) eae 


are independent. Another way of stating 3.6 is to say that the random 
vectors X, = ({X;}, j E Ją), are independent. 


How and when do we get independent random variables ? 


Theorem 3.7. A necessary and sufficient condition for X,, Xq,..., to be 
independent random variables is that for every n, and n-tuple (Xy, ..., Xn) 


Fx (x, CAI Xn) = Fx (x1) pe Fy (Xn). 
Proof. It is obviously necessary—consider the sets {X} < xy, . . -3 Xn < Xp}. 
To go the other way, we want to show that for arbitrary B,,..., Ba € By, 


P(X, € By. .. , X, € By) = [I POG € By) 
1 


Fix xz, . . . , X„ and define two o-additive measures Q and Q’ on B, by 
Q,(B) = P(X, € B, X: < Xos... Xa < Xn)s 
Q(B) = P(X, € B)P(X, < Xe, say Xn < Xn). 


Now on all sets of the form (— œ, x), Q and Q’ agree, implying that Q = Q’ 
on &,. Repeat this by fixing B, € Bı, x3,..., x, and defining 


Q(B) = P(X, € B4, X, E B, X; < Xz, pas Xn < Xa), 
O:(B) = P(X, € B,)P(X, € B)P(Xs < xy, - -= , Xp < Xn), 


so Q, = Q, on the sets (— œ, x), hence Q, = Q, on &,, and continue on 
down. 

The implication of this theorem is that if we have any one-dimensional 
distribution functions F,(x), F,(x),... and we form the consistent set 
of distribution functions (see Problem 8, Chapter 2) F,(x,)--- F,(x,), then 
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any resulting process X,, X,,... having these distribution functions consists 
of independent random variables. 


Proposition 3.8. Let X and Y be independent random variables, f and g B,- 
measurable functions such that E | f(X) < œ, E|g(Y)| < œ, then 


E|fXg(Y)| < 2 


E [fg] = EEY). 


Proof. For any set A € B,, take f(x) = x4(x); and consider the class 
£ of nonnegative functions g(y) for which the equality in 3.8 holds. £ is 
closed under linear combinations. By the Lebesgue monotone convergence 
theorem applied to both sides of the equation in 3.8, g„ E L, g, T g =>g EL. 
If Be B, and g(y) = Xg(y), then the equation becomes 


P(X € A, Y € B) = P(X € A)P(Y € B), 


and 


which holds by independence of X and Y. By 2.38, £ includes all nonnegative 
g-measurable 8,. Now fix g and apply to f to conclude that 3.8 is valid for 
all nonnegative g and f. 

For general g and f, note that 3.8 holding for nonnegative g and f implies 
E| fOQ = [EIFE |g()I], so integrability of f(X) and g(Y) 
implies that of f(X)g(Y). By writing f= ft—f-, g =g" — g we 
obtain the general result. 

Note that if X and Y are independent, then so are the random variables 

f(X) and g(Y). So actually the above proposition is no more general than the 
statement: Let X and Y be independent random variables, E |X| < œ, 
E\Y| < œ, then E|XY| < 0 and EXY = EX: EY. 

By induction, we get 


Corollary 3.9. Let X,,..., X„ be independent random variables such that 
E|X,| < 0, k = 1,..., n. Then 
EX X< © and E(X + X,) = TI EX. 
1 
Proof. Follows from 3.8 by induction. 


Problems 
1. Let ¥,, F, be independent o-fields. Show that if a set A is both in F, 
and ¥,, then P(A) = Oor 1. 


2. Use Fubini’s theorem (Appendix A.37) to show that for X and Y in- 
dependent random variables 


a) for any B € B,, P(X € B — y) is a R,-measurable function of y, 
b) P(X + Y € B) = | P(X € B — y)P(Y € dy). 


40 INDEPENDENCE 3.2 


2. TAIL EVENTS AND THE KOLMOGOROV ZERO-ONE LAW 


Consider the set E again, on which $,/n +> } for fair coin-tossing. As 
pointed out, this set has the odd property that whether or not w € E does not 
depend on the first n coordinates of w no matter how large n is. Sets which 
have this fascinating property we call tail events. 


Definition 3.10. Let X,, Xə, . . . be any process. A set E € F(X) will be called 
a tail event if E € F(Xn, Xnyy,--.), all n. Equivalently, let & be the o-field 
N FX Xni- - -), then % is called the tail o-field and any set E €% is 
called a tail event. 


This definition may seem formidable, but it captures formally the sense 
in which certain events do not depend on any finite number of their co- 
ordinates. For example, 


Bay) 
n 


is a tail event. Because for any k > |, 


E= C 
n 2 


hence E € F(X, Xy- - - -) for all k > 1, >E €%. 
An important class of tail events is given as follows: 


Definition 3.11. Let X,, Xa, ... be any process, B,, B, . . . Borel sets. The 
set X,, in B,, infinitely often, denoted {X „ € B, i.0.} is the set {w; ,X(w) € Bn 
occurs for an infinite number of n}. Equivalently, 


{X, €B,i.0.} = lim U {X,EB,} 
It is fairly apparent that for many strong limit theorems the events involved 
will be tail. Hence it is most gratifying that the following theorem is in 
force. 


Theorem 3.12. (Kolmogorov zero-one law). Let X,, X+, ... be independent 
random variables. Then if E € %, P(E) is either zero or one. 


Proof. E € F(X). By 2.33, there are sets E„ € ¥(X,,..., Xn) such that 
P(E, A E)— 0. This implies P(E,) > P(E), and P(E, O E) — P(E). But 
EE F(Xysi, Xn -- -), hence E and E, are in independent o-fields. Thus 
P(E, O E) = P(E,) P(E). Taking limits in this latter equation gives 

P(E) = [P(E)). 


The only solutions of x = x? are x =Oorl. Q.E.D. 
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This is really a heart-warming result. It puts us into secure business 
with strong limit theorems for independent random variables involving 
tail events. Either the theorem holds true for almost all œ € Q or it fails 
almost surely. 


Problems 


3. Show that {X,„ € B, i.o.} is a tail event. 


4. In the coin-tossing game let s be any sequence m-long of zeros or ones. 
Let Z,, be the vector (Xn+1» . - - , Xnim), and F the set {Z, = si.o.}. Show 
that F eg. 


5. (the random signs problem). Let c, be any sequence of real numbers. 
In the fair coin-tossing game let Y, = +1 as the nth toss is H or T. Let 
= {w; X,Y, converges}; show that D eğ. 


3. THE BOREL-CANTELLI LEMMA 


Every tail event has probability zero or one. Now the important question is: 
how to decide which is which. The Borel-Cantelli lemma is a most impor- 
tant step in that direction. It applies to a class of events which includes many 
tail-events, but it also has other interesting applications. 


Definition 3.13. In (Q,F,P), let A,€F. The set {A,i.0.} is defined as 
{w; w € A, for an infinite number of n}, or equivalently 


{A,i.0.} = lim U 4, 


m n>m 
Borel-Cantelli Lemma 3.14 
I. The direct half. If A, €F, then $? P(A,) < œ implies P(A, i.o.) = 0. 
To state the second part of the Borel-Cantelli lemma we need 


Definition 3.15. Events A,, Ag,..., in wi , P) will be called independent 
events if the random variables y 4, x 4, . +» are independent (see Problem 8). 


Il. The converse half. If A, € F are independent events then SPA.) = 0 
implies P(A, i.o.) = 1. 


Proof of I 
P(A, i.0.) = P (lim U Ay] = lim P(U An) < lim (> P(A,)). 
But obviously $ P(A,) < œ implies that X P(A,) — 0, as m — œ. 
1 m 
Proof of II 


P(U x) =1— r(A As). We show that P(N 4s) =0. Because 
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the events {A,} are independent, 
P(N As) = JI (4 = JI (1 — P(A,)). 
Use the inequality log (1 — x) < —x to get 


log (II (= P(A.) = Š log (1 — P(A,)) < - È P(A) EEEE 


Application I. In coin-tossing, let s be any sequence k-long of H, T. 
A, = {w; (wm. -., Onix-1) = 8}, 0 < P(Heads) < 1. 
Proposition 3.16. P(A,,i.0.) = 1. 


Proof. Let By = {w; (w,,..., @,) = S}, By = {w; (Okt -> Wax) = S)... 
The difficulty is that the A, are not independent events because of the over- 
lap, for instance, between A, and 4, but the B, are independent, and 


{A,,i.0.} > {B, i.0.}. Now P(B,) = P(B,) > 0, so > PCB, ) = œ, implying 
by 3.14(II) that P(B, io.) = 1. 

Another way of putting this proposition is that in coin-tossing (biased 
or not), given any finite sequence of H, T’s, this sequence will occur an 
infinite number of times as the tossing continues, except on a set of sequences 
of probability zero. 


Application 2. Again, in coin-tossing, let Y, = +1, as ith toss is H or T, 
Z= Y, +: + Y„. If Z, = 0, we say that an equalization (or return 
to the origin) takes place at time n. Let A, = {Z, = 0}. Then {A,i.0.} = 
{w; an infinite number of equalizations occur}. 


Proposition 3.17. If P( Heads) # 4, then P(Z, = 01.0.) = 0. 


Proof. Immediate, from the Borel-Cantelli lemma and the asymptotic 
expression for P(Z, = 0). 

Another statement of 3.17 is that in biased coin-tossing, as we continue 
tossing, we eventually come to a last equalization and past this toss there are 
no more equalizations. What if the coin is fair? 


Theorem 3.18. For a fair coin, P(Z, = 01.0.) = 1. 


Proof. The difficulty, of course, is that the events A, = {Z, = 0} are not 
independent, so 3.14 is not directly applicable. In order to get around this, 
we manufacture a most pedestrian proof, which is typical of the way in which 
the Borel-Cantelli lemma is stretched out to cover cases of nonindependent 
events. 

The idea of the proof is this; we want to apply the converse part of 
the Borel-Cantelli lemma, but in order to do this we can look only at the 
random variables X, related to disjoint stretches of tosses. That is, if we 
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consider a subsequence n, < ng < n; < `- of the integers, then any events 
{C,} such that each C, depends only on {Yp 1> Teg Yaneth Yaa are 
independent events to which the Borel-Cantelli lemma applies. Suppose, for 


instance, that we select n, < My, < ną and define 
C, = {Yna Hoe E Ym, £ o O {Yim n Yassa = m} 
The purpose of defining C, this way is that we know 


Zn, = Yi +i + Yn, Sm 


because each Y, is +1. Hence w € C, => Zm, < 0. Again Zm, > —m,, so, 
in addition, 
{w e C} => Z, >0. 


nky S 
Therefore {w € C,} => {Z, = 0 at least once form, +1 < n < my}. We 
have used here a standard trick in probability theory of considering stretches 
Ny, Ma, . . . SO far apart that the effect of what happened previously to n, 


is small as compared to the amount that Z, can change between n, and ny4 
Now 


{C,, i.0.} € {Z, = 0i.0.}. 


So we need only to prove now that the ,, m, can be selected in such a way 
that $? P(C,) = œ. 


Assertion: Given any number «, 0 <a < 1, and integer k > 1,3 an integer 
(k) > 1 such that 


P(Z ea) <k)<a. 
Proof. We know that for any fixed j, 
P(Z, = j)— 0. 
Hence for k fixed, as n — œ, 
> PZ, =j) > 0. 
lil <x 
Simply take y(k) sufficiently large so that 
> P(Z gw = j) < x. 


lil <k 


Define ny, Mm, as follows: ny = 1, m, = ny + P(n,), Mera = My + pim). 
Compute P(C,) as follows: 


P(C,) = PUY n 41 + poles + Y m, < mn) P(Y myt + TORS + Yari 2 m,)- 
By symmetry, 
P(C,) = roa (Rowe + ani + Nal 2> my) PLY mti + eee + Yaral = m,). 
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Thus, since the distribution of the vector (Y,,:,..., Y;,;) is the same as that 
of (Yi... Y), 
P(C,) = P(Y to + Yma Z PUY +o + Vag ml 2 M) 

=4P(Y te Temp! =m)PUY, te Yotm,)| = m,) 

2 i(1 — a)’. Q.E.D. 
This proof is a bit of a mess. Now let me suggest a much more exciting 
possibility. Suppose we can prove that P(Z,, = 0 at least once) = 1. Now 
every time there is an equalization, everything starts all over again. That 
is, if Z, = 0, then the game starts from the (n + 1) toss as though it were 
beginning at n = 0. Consequently, we are sure now to have at least one 
more equalization. Continue this argument now ad infinitum to conclude that 
P(Z,, = Oat least once) = 1 > P(Z, = 01.0.) = 1. We make this argument 
hold water when 3.18 is generalized in Section 7, and generalize it again 
in Chapter 7. 


Problems 

6. Show, by using 3.14, that X, 2 X = J a subsequence Xa such that 
Xn => X. 

7. Show, using 3.14, that if X, — Xm n 0, 3 a random variable X such that 
X, =x. [Hint: Take e, | 0 and n, such that for m,n > np 


1 
PX, — Xml > &) S z 


Now prove that there is a random variable X such that X,, 23 xX] 
8. In order that events A;, 4», . . . be independent, show it is sufficient that 


P(A; Nc A;,) = P(A;,) °° * P(A;,) 


for every finite subcollection 4; ,..., A; [One interesting approach to the 
required proof is: Let D be the smallest field containing A,,..., Ay. 
Define Q on D by Q(B, N--- O By) = P(B): ++ P(Bx), where the sets 
B, are equal to A, or Ag. Use P(A, OA NA Ain) = P(A) t P(A;,) to 
show that P = Q on a class of sets to which 2.23 can be applied. Conclude 
that P = Q on D.] 


9. Use the strong law of large numbers in the form S,/n— p a.s. to 
prove 3.17. 


10. Let X,, X,,... be independent identically distributed random variables. 
Prove that E |X,| < œ if and only if 


P(X) > ni.o.) = 0. 
(See Loéve [108, p. 239].) 


3.4 THE RANDOM SIGNS PROBLEM 45 


4. THE RANDOM SIGNS PROBLEM 


In Problem 5, it is shown for Y}, Y2,..., independent +1 or —1 with 
probability 4, that the set {w; >” c,Y, converges} is a tail event. Therefore 
it has probability zero or one. The question now is to characterize the 
sequences {c,,} such that >” c, Y,, converges a.s. 

This question is naturally arrived at when you look at the sequence 
1/n, that is, & 1/n diverges, but X (—1)"1/n converges. Now what happens if 
the signs are chosen at random? In general, look at the consecutive sums 
>" X, of any sequence X,, Xg,... of independent random variables. The 
convergence set is again a tail event. When does it have probability one? 

The basic result here is that in this situation convergence in probability 
implies the much stronger convergence almost surely. 


Theorem 3.19. For X,, X2, . . . independent random variables, 
(3.20) YX SH DX, a 
1 1 


Proof. Proceeds by an important lemma which is due to Skorokhod, [125]. 


Lemma 3.21. Let S,,...,Sy be successive sums of independent random 
variables such that sup;<y P(|\Sy — S;| > «) =c < 1. Then 


P (sup IS,| > 22) < — P(Sy| > a). 
ISN 1—c 
Proof. Let j*(w) = {first j such that |S,| > 2a}. Then 
N 
P (ISxi >a, supIS,| > 2x) = > PS >a j*=)) 
ISN j=1 


N 
> È P(S = SI <e j* = j) 
The set {j* =j} is in F(X,,...,%,), and Sy — S, is measurable 
F(Xj41,..-, Xy), so the last sum on the right above equals 
N 


> Sy = S;| < «)PQG* = j). 


j= 


Use P(|Sy — S,;| < «) > 1 — c to get 
N 
P(ISxi >a sup |S,1 > 2a) >- 0) P* =j) 
ISN g=1 


=(1—-— OP (sup IS; > 2a). 
ISN 
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The observation that 
P(|Sy| > 2, sup IS; > 2a) < P(|Syj > a) 
GSN 
completes the proof. 


To finish the proof of 3.19: If a sequence s, of real numbers does not 
converge, then there exists an e > Osuch that for every m, 


sup |S, — Sml > €. 


nin>m 


So if $ X, diverges with positive probability then there exists ane > 0 
1 
and ô > 0 such that for every m fixed, 


P(sup ÈX, | >< 


ml 
By (3.21), keeping m, N fixed 


N 


È Xy 


m+t 


Nn 


> X, 


m+1 


P( sup 


m<niNn 


1 
<—t p 
a ( 


€ 
> 5). 


If > X, is convergent in probability, then Ž X, — 5 X, >0 as m,N — œ. 


r (|2 


Taking first N —> œ, conclude 


where 


N 
Cay = sup P(|2% 


m<nsNn 


Hence; as m,N > ©, 


m+1 
so we find that 


È Xx 
m41 


lim P (sup 


n>m 


«) =0. 


We can use convergence in second mean to get an immediate criterion. 


This contradiction proves the theorem. 


Corollary 3.22. If EX, = O, all k, and > EX? < œ, then the sums 2 X, 
converge a.s. 


In particular, for the random signs problem, mentioned at the beginning 
of this section, the following corollary holds. 
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Corollary 3.23. A sufficient condition for the sums > c,V), to converge a.s. is 
> È< o. 
The open question is necessity. Marvelously enough, the converse of 3.23 


is true, so that È c,Y, converges if and only if 2c? < œ. In fact, a partial 
converse of 3.22 holds. 


Theorem 3.24. Let X,, X, ... be independent random variables such that 
EX, = 0, and |X,| < « < œ, all k. Then $x, converges a.s. implies 
fea) 1 

> EX? < œ. 

1 

Proof. For any A > 0, define n*(w) by 


ÈX. | > 4 


i 


Ist n such that 


n*(w) = 


+ œ if no such n exists, 


where n* is an extended random variable. For any integers j < N, look at 


N 2 
Í (è x.) dP 
{n*=3} 


Since {n* = j} € ¥(X,,..., X,), then by independence, and EX, = 0, all 


j 2 N 
e sey =| (3 x) dP + > X? dP. 
{n*=3} n*=; \I fol J{n*=3} 


On {n | <A. Hence since |X;| < «, Èx |[<Så+a 


And, by B, fork >j, 
f Xe dP = P(n* = {EX}. 
Using these, TE 
N 2 N 
Í i (= x) dP < (A + a)*P(n* = j) + P(n* = j) > EX? 
n =j 1 1 
Sum on j from 1 up to N to get 
N 
Í (2%. ) dP < (A + a)*P(n* < N) + P(n* < N): SEX? 
{a*<N} 1 
Also, 
N g 
f (> x,) dP < 22P(n* > N) < (A + «)*P(n* > N). 
[nt> N} 


1 


48 INDEPENDENCE 3.4 


Adding this to the above inequality we get 


N N 
> EX? < (A + a}? + P(n* < N) > EX? 
1 1 

or 


N 
P(n* > N)Y EX, < (A + a’. 
1 
Letting N — œ, we find that 
P(n* = 0): DEX < (A + a}. 
1 


n 
But, since > X, converges a.s., then there must exist a A such that 
1 


P (sup Èx | < a) >0, 
n 1 
implying P(n* = œ) > 0 and } EX? < œ. 


The results of 3.24 can be ‘considerably sharpened. But why bother; 
elegant necessary and sufficient conditions exist for the convergence of sums 
>; X, where the only assumption made is that the X, are independent. 
This is the “three-series” theorem of Kolmogorov (see Loéve [108, p. 
237]). More on this will appear in Chapter 9. Kac [82] has interesting 
analytic proofs of 3.23 and its converse. 


Problems 
11. Let X,, X,,... be independent, and X, > 0. If for some 6,0 < ô < 1, 


there exists an x such that 


i X, dP < EX, 
{X> z} 
for all k, show that 


$ X, < was. DEX, < œ. 
1 1 


Give an example to show that in general, X,, X,,... independent non- 


negative random variables and È X, < œ a.s. does not imply that 
1 


> EX, < œ. 
1 
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12. Let Y;, Yz, . . . bea process. We will say that the integer-valued random 
variables my, Mg, . . . are optional skipping variables if 

i) l<em<m<-::-, 

ii) {m, =j} E€ F(%,..., Yj) 

(i.e., the decision as to which game to play next depends only on the previous 
outcomes). Denote Y; = Ym, Show that 


a) If the Y,, Ya... are independent and identically distributed then the 
sequence Ý. Ya ... has the same distribution as Y,, Ya, ... 

b) For Yı, Ya... as in (a), show that the sequence Ym, Ymr Ymo- 
has the same distribution as Y4, Yo,... ce 

c) Give an example where the Y,, Y», . . . are independent, but the Y,, Yo, ... 
are not independent. 


5. THE LAW OF PURE TYPES 


Suppose that X,, Xz, . . . are independent and >” X, converges a.s. What 
can be said about the distribution of the limit X = $? X,? In general, very 
little! In the nature of things the distribution of X can be anything. For 
example, let X, = 0, k > 1, then X = X,. There is one result available 
here which is an application of the Kolmogorov zero-one law and remarkable 
for its simplicity and elegance. 


Definition 3.25. A random variable X is said to have a distribution of pure type 
if either 


i) There is a countable set D such that P(X € D) = 1, 

ii) P(X = x) = 0 for every x e R", but there is a set D € B, of Lebesgue 
measure zero such that P(X € D) = 1, or 

iii) P(X € dx) « U(dx)(Lebesgue measure) 


[Recall that u « v for two measures u,v» denotes u absolutely continuous 
with respect to v; see Appendix A.29]. 


Theorem 3.26 (Jessen-Wintner law of pure types [78]). Let X,, Xe,... be 
independent random variables such that 

i) 3X, Xas, 

il) For each k, there is a countable set F, such that P(X, € F,) = 1. 

Then the distribution of X is of pure type. 

Proof. Let F = U F, Take G to be the smallest additive group in R® 


k>1 
containing F. G consists of all numbers of the form m x, + -°* + M;X; 


50 INDEPENDENCE 3.5 
x,...,%,€ Fandm,...,m,integers. Fis countable, hence G is countable. 
For any set B € R") write 


G@®B= {xe RV; x = x + Xz x, EG, x, E B}. 
Note that 


i) B countable > G @ B countable, 
ii) Be B, =>G9BeB, 
iii) B € B, (B) = 0 = KG @ B) = 0. 


For B € ,, and C = {w, > X, converges}, consider the event 
1 


A={XEGOBNC. 
The point is that A is a tail event. Because if x, — x, € G, then 
xyiEGOBoSx,EGOB. 
But X — $x e G for all w in C. Hence 


A=(3x,eGos|NCn=1, Dy es 


By the zero-one law P(A) = 0 or 1. This gives the alternatives: 


a) Either there is a countable set D such that P(X €e D) =1lor 
P(X € G @ B) = 0, hence P(X e B) = 0, for all countable sets B. 

b) If the latter in (a) holds, then either there is a set D e B, such that 
(D) = 0 and P(X € D) = 1 or P(X EG @ B) = 0, for all B € B, such 
that /(B) = 0. 

c) In this latter case B € B, I(B) = 0 = P(X e B) = 0, that is, the dis- 
tribution is absolutely continuous with respect to /(dx). 


Theorem 3.26 gives no help as to which type the distribution of the limit 
random variable belongs to. In particular, for Y,,... independent +1 
with probability 4, the question of the type of the distribution of the sums 


>cnY, is open. Some important special cases are given in the following 
1 


problems. 


Problems 
13. Show that P(S Y,/2* edx) is Lebesgue measure on [0, 1]. [Recall 


1 
the analytic model for coin-tossing.] 


14. If X and Y are independent random variables, use Problem 2 to show 
that P(X € dx) « (dx) = P(X + Y e dx) « I(dx). 
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15, Use Problems 13 and 14 to show that the distribution of 
DXY,/k is <« U(dx). 
1 


16. Show that if independent random variables X,, X,,... take values in a 
countable set F, and if there are constants «,, € F such that 


X P(X, #a,) <0 and Ya, < 0, 
1 1 


0 
then the sum X = > X, has distribution concentrated on a countable number 
1 


of points. (The converse is also true; see Lévy [101].) 


6. THE LAW OF LARGE NUMBERS FOR INDEPENDENT 
RANDOM VARIABLES 


From the random signs problem, by some juggling, we can generalize the 
law of large numbers for independent random variables. 


Theorem 3.27. Let X,, X2,... be independent random variables, EX, = 0, 
EX? < œ. Let b, > O converge up to +œ. If > EX3/b2 < œ, then 
1 


Xt + Xp, as 
bn 


0. 


Proof. To prove this we need: 


Kronecker’s Lemma 3.28. Let xı, X2,... be a sequence of real numbers such 


that $ x, > s finite. Take b,, Î œ, then 
1 


ma 


b,x; — 0. 


| 


n 


«oO 
Proof. Letr, = > Xp % = s; then x, = fpa — Fe n = 1,2,..., and 
nt+1 


n n n—1 n 
> bx, = > bilfri — Ta) = x bryr — x bys 
1 


n—1 
= È (bar — b,)r, + bys — bar, 
1 
Thus 


n n—-1 
(3.29) | S bx | <S (bra — by) Irel + bi Isl + bn Ira 
1 1 
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For any e > 0, take N such that |r,| < e for k > N. Then letting 7 = 
max, |7,| we get 


nl N-1 n-1 
È (ress — by) In| < 2 (beri — ba) Irs! + ED (bira — b,) 
1 k= 


< F(by — bı) + (6, — by). 
Divide (3.29) by b„, and take lim sup, noting that by/b,, — 0, |r,| — 0, to get 
1 n 
— > bx,| <e 
b, 1 


lim 
n 


To prove 3.27, by Kronecker’s lemma, if $ (X,/b,) converges a.s., then 
1 
-3 X,70 as. 
By 3.22, itis sufficient that 


> E(X,/6,)? = > EX?/b? < 0. 
f : QED. 


As a consequence of 3.27, if the X,, X, . . . are identically distributed, 
EX, = 0, and EX} = o? < œ, then 


si E E n 
— < o > +? 0 as. 
2< 2> bop o 


This is stronger than 1.21, which gives the same conclusion for b, = n. 


For example, we could take 5, = n'/***, any e > 0; or b, = Vn log n. 
But the strong law of large numbers is basically a first-moment theorem. 


Theorem 3.30. Let X,, Xe,... be independent and identically distributed 
random variables; if E |X,| < © then 


At tX, #8 py, ; 
n 
if E |X,| = œ, then the above averages diverge almost everywhere. 


Proof. In order to apply 3.27 define truncated random variables X,, by 
> X» if |X,| <n, 
x 


0, IX,| > n. 
By Problem 10, P(|X,| > n i.o.) = 0. Hence (3.30) is equivalent to 
Xt +X, as. 
Eas 
n 


n 


EX,. 
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But EX,, > EX,, so (3.30) will follow if 3.27 can be applied to show that 
15%, — EX) —>0. 
nt 
Since E(X, — EX,)? < EX?, it is sufficient to show that 
sie = 51] x*F(dx) < œ. 
in 1 N v |z|Sn 


This follows from writing the right-hand side as 
pe ha | x?F(dx). 
k-1<|2| Sk 


Interchange order of summation, and use > 1/n? < 2/k, k > 1, to get 
k 


Shmxt< $3? | x F(dx) < 2E |X|. 
ih hark JVk-1<|æ|<k 


For the converse, suppose that S,,/n converges on a set of positive probability. 
Then it converges a.s. The contradiction is that 


“eae (A=) Sii 


n n n Jn—1 


must converge a.s. to zero, implying P(|X,| > ni.o.) = 0. This is impossible 
by Problem 10. 


7. RECURRENCE OF SUMS 


Through this section let X,, Xa, . . . be a sequence of independent, identically 
distributed random variables. Form the successive sums S,, So, .. 


Definition 3.31. For x € R®, call x a recurrent state if for every neighborhood 
Iof x, PS, € Iio.) = 1. 


The problem is to characterize the set of recurrent states. In coin-tossing, with 
X,, Xa, . . . equaling +1 with probability p,q, if p Æ $, then P(S, = 0i.o.) = 0. 
In fact, the strong law of large numbers implies that for any state j, 
P(S, = j i.o.) = O—no states are recurrent. For fair coin-tossing, every 
time S, returns to zero, the probability of entering the state j is the same as 
it was at the start when n = 0. It is natural to surmise that in this case 
PS, = j i.o.) = 1 for all j. But we can use this kind of reasoning for any 
distribution, that is, if there is any recurrent state, then all states should be 
recurrent. 
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Definition 3.32. Say that a random variable X is distributed on the lattice 

= {nd},n = 0, +1,..., d any real number >0, if &,P(X = nd) = l and 
there is no smaller lattice having this property. If X is not distributed on any 
lattice, it is called nonlattice. In this case, say that it is distributed on Ly, 
where Ly = R. 


Theorem 3.33. If X,, Xz, . .. are distributed on La, d > 0, then either every 
state in L is recurrent, or no states are recurrent. 


Proof. Let G be the set of recurrent points. Then G is closed. Because 
X, E G, xX, —> x implies that for every neighborhood Z of x, x, EJ for n 
sufficiently large. Hence P(S, € I i.o.) = 1. 


Define y to be a possible state if for every neighborhood J of y, 3 k 
such that P(S, € I) > 0. J assert that 


x recurrent, y possible = x — y recurrent. 
To show this, take any e > 0, and k such that P(|S, — y| < €) > 0. Then 
P(|S, — x| < € finitely often) 
> PIS, — yl < €, ISken — Sp — (x — y)| < 2e finitely often) 
= P(|S, — y| < =) P(|S, — (x — y)| < 2e finitely often). 
The left side is zero, implying 
P(S, — (x — y)| < 2e finitely often) = 0. 
If G is not empty, it contains at least one state x. Since every recurrent state 
is a possible state, x — x = 0 e G. Hence G is a group, and therefore is a 
closed subgroup of R®. But the only closed subgroups of R® are the 
lattices L}, d > 0. For every possible state y, 0 — y e G >yeG. For 
d > 0, this implies L, © G, hence La = G. If X, is non-lattice, G cannot 
be a lattice, so G = R™®. 


A criterion for which alternative holds is established in the following 
theorem. 


Theorem 3.34. Let X,, X}, ... be distributed on La, d > 0. If there is a finite 
interval J, J O Lı Æ Ø such that > PS, EJ). < œ, then no states are 


recurrent. If there is a finite interval J such that Š P(S, = J) = œ, then all 
states in Ly are recurrent. 


Proof. If $7 P(S,€J) < œ, use the Borel-Cantelli lemma to get 
P(S, €J i.o.) = 0. There is at least one state in L, that is not recurrent, 
hence none are. To go the other way, we come up against the same difficulty 
as in proving 3.18, the successive sums S,,5,,... are not independent. 
Now we make essential use of the idea that every time one of the sums 
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Si, S2, . .. comes back to the neighborhood of a state x, the subsequent 

process behaves nearly as if we had started off at the state x at n = 0. If 

> P(S, €J) =œ, for any e > 0 and less than half the length of J, there is a 

1 æ 

subinterval J = (x — e, x + €) < J such that $ P(S, €I) =œ. Define sets 
1 


Yo {S EI, Sm fI n= 1,2,...} Kok 
— 
{S,¢éI,n=1,2,...}, k=0. 
A, is the set on which the last visit to J occurred at the kth trial. Then 


{S, „eI finitely often} = U A,. 
0 
The A, are disjoint, hence 
P(S,, € I finitely often) = > P(A,). 
0 


Fork > 1, 
P(A;) = P(S, € J, [Snte 5 Sxl = 2e, n = 1, 2, s says 


Use independence, then identical distribution of the {X,,} to get 
P(A,) > P(S, € DP(Snsx — Spl > 2e, n= 1,2,...) 
= P(S, € 1)P(\S,| > 26, n= 1,2,...). 
This inequality holds for all k > 1. Thus 


P(S,, € I finitely often) > P(|S,| > 2e, n = 1,2,...)> P(S, € D). 
1 


Since > P(S, € I) = œ, we conclude that for every « > 0 
1 
(3.35) PUS,| > 2e, n=1,2,...) = 0. 


Now take / = (—«, +e), and define the sets A, as above. Denote J, = 
(—ô, +ô), so that 


A, = lim {SpE Is Sn Eln =1,2,...}, k>1. 
öle 
Since the sequence of sets is monotone, 
P(A,) = lim P(S, €15,SninGi,n = 1,2,...), k>1. 
éte 
Now use (3.35) in getting 
P(S, € Ly, Sain E n = 1,2,...) 
< P(S, € Ty, [Snye — Sil = es ò, n = Lose) 
= P(S,€1,)P(|S,| > € — ôn =1,...) =90 
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to conclude P(A,) = 0,k > 1. Use (3.35) directly to establish 
P(A,) = P(S, > €en = 1,2,...) =0. 


Therefore P(S, € J finitely often) = 0, and the sums S, enter every neighbor- 
hood of the origin infinitely often with probability one. So the origin is a 
recurrent state, and consequently all states in L; are recurrent. Q.E.D. 


Look again at the statement of this theorem. An immediate application 
of the Borel-Cantelli lemma gives a zero-one property: 


Corollary 3.36. Either 
P(S, € Lio.) = 1 


for all finite intervals I such that Li A I # Ø; or 


P(S, € Ji.o.) = 0 
for all such I. 


Definition 3.37. If the first alternative in 3.36 holds, call the process S}, S2,... 
recurrent. If the second holds, call it transient. 


A quick corollary of 3.34 is a proof that fair coin-tossing is recurrent. 
Simply use the estimate P(S}, = 0) ~ 1 IV mn to deduce that 


diverges. 

The criterion for recurrence given in 3.34 is difficult to check directly 
in terms of the distribution of X,, X,,... A slightly more workable expression 
will be developed in Chapter 8. There is one important general result, 
however. If 

E\|X,|< 0 and EX, #0, 


then by the law of large numbers, the sums are transient. If EX, = 0, the 
issue is in doubt. All that is known is that S, = o(n) [o(m) denoting small 
order of n]. There is no reason why the sums should behave as regularly as 
the sums in coin-tossing. (See Problem 17 for a particularly badly-behaved 
example of successive sums with zero means.) But, at any rate, 


Theorem 3.38. If EX, = 0, then the sums S,, 8,,... are recurrent. 
Proof. First, we need to prove 


Proposition 3.39. If I is any interval of length a, then 


SPS, €1) <1 + Š PUS,| < a). 
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Proof. Denote N = > 7,(S,), so that N counts the number of times that 
1 


the sums enter the interval J. Define an extended random variable n* by 


n” = 


. n such that S, € J, 
ea Se SE Owe: 


foo} 


EN=> N dP. 


1 {n*=k} 
On {n* = k}, XS) = 0, n < k, and x,(S,) = 1. Thus, on {n* = k}, 
denoting by 7 — y the interval J shifted left a distance y, 


N<Ii+ È 1-5,(Snet — S;) 
<1 + È Xiana Snte — Sh) 
n=1 
Since {n* = k} E€ ¥,(X%,..., X,), 
EN < P(n* < w|i +Š PUS,| < a). 
1 
We use 3.39 to prove 3.38 as follows: For any positive integer M, 


SP(-M <S,<M)= 5 SPS elk k+ 1) 


k=-M 1 


< 2m(1 $ È PAS < D). 


Hence 
g $ 
(3.40) lim — 5 P(|S„| < M) < 1 + X P(IS,| < 1). 
2M 1 1 
The strong law of large numbers implies the weaker result S,/n =, 0, or 
P(|S,| < en) — 1 for every e > 0. Fix e, and take m so that P((S,| < en) > 4, 
n >m. Then P(|S,| <M) >4,m <n < M/e, which gives 
> PCS,| < M) > i(M/e — m), 
1 
(1/2M) Š P(ISq| < M) > 1/4e — m/4M. 
1 


Substituting this into (3.40), we get 


1+ ¥ P(IS;| < 1) > 1/4e. 
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Since e is arbitrary, conclude that 


$ P(S,| < 1) = œ. 


By 3.34 the sums are recurrent. Q.E.D. 


Problems 


17. Unfavorable “fair” game, due to Feller [59, Vol. I, p. 246]. Let X1, X2,... 
be independent and identically distributed, and take values in {0, 2, 2?, 23, . . .} 
so that 


1 
2k(k + 1)’ 
and define P(X, = 0) to make the sum unity. Now EX, = 1, but show that 
for every « > 0, 


P(X, = 24) = 


P(S, =n < B Zs. 

log. n 
18. Consider k fair coin-tossing games being carried on independently of 
each other, giving rise to the sequence of random variables 


a) yap 
YS ees 
(2) y(2. 
Ye ons 


(k) y (k) 
YE yo, 


where Y‘? is +1 as the nth outcome of the jth game is H or T. Let Z% = 
Yi) +--+ + YP, and plot the progress of each game on one axis of 
R™, The point described is Z, = (Z®,...,Z®) =Y, +: +Y, 
where Y, takes any one of the values (+1,..., +1) with probability 1/2*. 
Denote 0 = (0,0,...,0). Now Z, = 0 only if equalization takes place in 
all k games simultaneously. Show that 


; 1, k=1,2, 
P(Z, = 01.0.) = 
0, k>3. 


8. STOPPING TIMES AND EQUIDISTRIBUTION OF SUMS 


Among the many nice applications of the law of large numbers, I am going 
to pick one. Suppose that X,, X, . . . are distributed on the lattice L, and 
the sums are recurrent. For any interval Z such that Z A L; # Ø, the number 
of S,,...,5, falling into 7 goes to œ. Denote this number by N,(/). Then 
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N,,/n, the average number of landings in J per unit time, goes to zero in 
general (see Problem 20). 

An interesting result is that the points S;,..., S, become a.s. uniformly 
distributed in the sense that for any two finite intervals 


Na) Mal 
(In the lattice case, define ||Z|| as the number of points of La in I A La). 
This equidistribution is clearly a strong property. The general proof is not 
elementary (see Harris and Robbins [69]). But there is an interesting proof 
in the lattice case which introduces some useful concepts. The idea is to 
look at the number of landings of the S, sequence in J between successive 
zeros of S,,. 


Definition 3.41. A positive, integer-valued random variable n* is called a 
stopping time for the sums S,, So, ... if 


{n* = k} e F(S,,..., S,). 


The field of events F(S,, k < n*) depending on S,, up to time of stopping consists 
of all A € F(X) such that 
A N {n* =kyeF(S,,..., Sh). 


For example, in recurrent case, d = 1, let n* be the first entry of the sums 
{S„} into the state j. This is a stopping time. More important, once at state j, 
the process continues by adding independent random variables, so that 
Sase — Sas Should have the same distribution as S, and be independent of _ 
anything that happened up to time n*. 


Proposition 3.42. If n* is a stopping time, then the process Š, = Sarge — Sres 
k =1,... has the same distribution as S,, k = 1,... and ¥(S,, S, .. .) is 
independent of F(S,, k < n*). 


Proof. Let A E€ F(S,,k < n*), Be B,,, and write 
P(Š € B, A) = > P(Š € B, A, n* = n). 
n=1 
On the set {n* = n}, the Š, process is equal to the S,,, — S, process, and 


A {n* = n} e F(Si,..., Sp). SinceS,,, — Sp, k = 1, 2, . . . has the same 
distribution as Są, k = 1,2,... 


PŠ € B, A) = P(S € B) Š P(A, n* = n) = P(S € B)P(A). 
n=1 


Note that n* itself is measurable #(S,, k < n*). 
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Definition 3.43. The times of the zeros of S,, are defined by 


R, = min {n; S, = 0,2 > 0}, 
R, = min {n; S, = 0,2 > R} 


R. = min {n; S, = 0, n > Rpa} 


R, is called the kth occurrence time of {S,, = 0}. The times between zeros are 
defined by T, = Ry, Ta = R; — Ry... 


The usefulness of these random variables is partially accounted for by 


Proposition 3.44. If P(S, = 0 i.o.) = 1, then the T,, T2,... are independent 
and identically distributed random variables. 


Proof. T, is certainly a stopping time. By 3.42, 5, = S, +7, — Sp, has the 
same distribution as S,, but this process is independent of T,. Thus, Ty, 
which is the first equalization time for the Še process, is independent of 
T, and has the same distribution. Repeat this argument for k = 3,.... 


Theorem 3.45. Let X,, Xo, . . . have lattice distance one, and P(S, = Q i.o.) = 1. 
Then, for any two states, j, l, 


NO) 98 1, 

N.C) 
Proof. Let R,, Ra... be the times of successive returns to the origin, 
T,, Ta, .. . the times between return. The T,, T,,... are independent and 
identically distributed. Let M,(j) be the number of landings in j before the 
first return to the origin, M,(j), the number between the first and second 
returns, etc. The M,,M,,... are similarly independent and identically 
distributed (see Problem 22). The law of large numbers could be applied to 
(M, +++: + M,)/k if we knew something about EM,(j). Denote 7(j) = 
EM,(j), and assume for the moment that for all j e L,, 0 < m(j) < œ. 
Since M,(j) + -+ + MG) = Na); 


Na) a.s. mj) 
LRU ER 
Na) a(l) 
This gives convergence of N,(j)/N,(/) along a random subsequence. To get 
convergence over the full sequence, write 
NRD NaC) L Ne...) — NRO) 
max |- — —=—| < ——=— M 
o<m=Tra| Neyen) Ne (D) Ne, () 
Ne GENe,..(2) — Ne] 
[Nk (OV? 


j#0, 140. 
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Dividing the top and bottom of the first term on the right by k we have that 
lim, of that term is given by 


l im (2). 
EM, (I) + k 

By Problem 10, this term is a.s. zero. The second term is treated similarly to 
get 


a Na) = 20) 
ND A) 

Given that we have landed in j for the first time, let 4, be the probability, 
starting from j, that we return to the origin before another landing in j. 
This must occur with positive probability, otherwise P(S, = 0 i.o.) < 1. 
So whether or not another landing occurs before return is decided by tossing 
a coin with probability 2, of failure. The expected number of additional 
landings past the first is given by the expected number of trials until failure. 


This is given by $ m(1 — A,)"A,; < œ, hence z(j) is finite. Add the con- 
1 
vention 7(0) = 1, then (3.46) holds for all states, j and /. Let n* be the first 
time that state / is entered. By (3.46) 
Natan(i) Beis aj) 
Natin(l) a(l) 


But N,,.,,(/) is the number of times that the sums S,,,+ — Shs k = 1,...,” 
land in state zero, and N,.,,(j) is the number of times that S,.44 — Sys, 
k =1,...,n land inj — /, plus the number of landings in j by the sums 
S, up to time n*. Therefore, z(j)/7(/) = m(j — D/7(0), or 


a(j) = (irj — l). 


This is the exponential equation on the integers—the only solutions are 
m(j) = r°. Consider any sequence of states m,,..., Mp—ı» 0 terminating at 
zero. I assert that 


P(S, = m, ... Spa = Mya» Sy = 0) 
= P(S = =M, S: = =M,- - - -s Sa = —M, S, = 0). 


The first probability is P(X, = m,...,X, = —M,„—1)) The fact that 
X,,..., X, are identically distributed lets us equate this to 


P(X, = =M- An = m), 


which is the second probability. This implies that 7(j) = m(—j), hence 
a(j) = 1. 
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Problems 


19. For fair coin-tossing equalizations P(T, = 2n) = 1/n(s_,-2C,_)2-2"*1. 
Use this result to show that 


a) P(T, = 2n) ~ 


1 
2 Jan 


b) From (a) conclude ET, = œ. 


c) Using (a) again, show that P(T, > 2n) ~ Fea 


d) Use (c) to show that P(T, > 2k? i.o.) = 1. 
— R 
e) Conclude from (d) that (iim, ma = o) = |, 
(There are a number of ways of deriving the exact expression above for 
P(T, = 2n); see Feller (59, Vol. I, pp. 74-75].) 
20. For fair coin-tossing, use P(S,, = 0) ~ 1/7 to show that 
E[N,(0)}? ~ en. 


Use an argument similar to the proof of the strong law of large numbers for 
fair-coin tossing, Theorem 1.21, to prove 


NiO) as, 9, 
n 
21. Define 


ny = min {n;S, > 0}, 
ng = min {n; Snint > Sas} and so forth, 


Foal Pat nt — Sag 4 ee + nti k> 1, 
Sn k=1. 
Show that the sequence (n*, Y,) are independent, identically distributed 
vectors. Use the law of large numbers to prove that if E |X,| < œ, EX, > 0, 
then if one of EY,, En* is finite, so is the other, and 

(Eni XEX:) = EY.. 
Show by using the sequence 

A ha X; < À, 


lo X>å, 
for A > 0, that EY, < œ, Enf < œ. (See Blackwell [7].) 


3.9 HEWITT-SAVAGE ZERO-ONE LAW 63 


22. For sums Sj, S, ... such that P(S, = O i.o.) = 1, and R,, Ro... the 
occurrence times of {S,, = 0}, define the vectors Z, by 
Zr = (Sk rr) Sri) 


Define the appropriate range space (R, B) for each of these vectors and show 
that they are independent and identically distributed. 


9. HEWITT-SAVAGE ZERO-ONE LAW 


Section 7 proved that for any interval 7, P(S, € I i.o.) is zero or one. But 
these sets are not tail events; whether $, = 0 an infinite number of times 
depends strongly on X,. However, there is another zero-one law in operation, 
formulated recently by Hewitt and Savage [71] which covers a variety of 
non-tail events. 


Definition 3.47. For a process X,, X2,..., A € F(X) is said to be symmetric 
if for any finite permutation {i,, iz,...} of {1, 2,...}, there is a set BE Bo 
such that 

A = {(X;, Xz...) € B} = (XK, X,- -) E B}. 


Theorem 3.48. For X,, X., ... independent and identically distributed, every 
symmetric set has probability zero or one. 


Proof. The short proof we give here is due to Feller [59, Vol. II]. Take 
An E F(X,,...,X,) sO that P(A A A,) > 0. A, can be written 


(Xi -s Xa) E By}, Br E By. 


Because the X = (X, X,» ...) process, i, is, ... any sequence of distinct 
integers, has the same distribution as X, 


P(XeEC)=P(KEC), CERB,y. 
Hence, for any Be B,,, 
(3.49) P((K € B} A {Š € B,}) = P({X € B} A {X € B,}). 
Take (i, ig,...) = (2n, 2n — 1,...,1,2n + 1,...). 

A, = {Ke By} = (Xen, -- +» Xaya) E Bn} 

By (3.49), taking B € &,, such that {Ñ € B} = {X € B} = 4A, 
P(A A A,) = P(A A An) > 0 > P(A, A A,) > 0 > P(A, 1 A,) — P(A). 
But A,, and 4, are independent, thus 

P(A, N A,) = P(A,)P(A,) > P(A)*. 
Again, as in the Kolmogorov zero-one law, we wind up with 

P(A) = P(A}. Q.E.D. 
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Corollary 3.50. For X,, X,,... independent and identically distributed, {S,} 
the sums, every tail event on the S,, Sy, . . . process has probability zero or one. 


Proof. For A a tail event, if {i, iz, ...} permutes only the first n indices, 
take B Ee Ba such that 

A = {(S,41,.-.) EB} 
Thus A is a symmetric set. 

This result leads to the mention of another famous strong limit theorem 
which will be proved much later. If EX, = 0, EX? < œ for independent, 
identically distributed random variables, then the form 3.27 of the strong 
law of large numbers implies 


Xi to + Xn as. 0 
Jn logn f 
On the other hand, it is not hard to show that 
_ X+ +X, 
lim SS ee oe 
mt o tX, = + œ a.s. 
Jn 
Therefore fluctuations of S, should be somewhere in the range Vn to 


Vn logn. For any function h(n) f œ, the random variable lim |S,,|/A(1) is a 
tail random variable, hence a.s constant. The famous law of the iterated 


logarithm is 


Theorem 3.52 


— œ a.s. 


(3.51) 


— S 
ECT = 
This is equivalent to: For every « > 0, 
P(S,| > (1 — €h(n) i.o.) = 1, 
P(|S,| > (1 + A(n) i.o.) = 0, 
with h(n) = o V2n log (log n). 
Therefore, a more general version of the law of the iterated logarithm 
would be a separation of all nondecreasing A(n) into two classes 
P(S,| > A(n)i.o.) =O or 1. 


The latter dichotomy holds because of 3.50. The proof of 3.52 is quite 
tricky, to say nothing of the more general version. The simplest proof 
around for coin-tossing is in Feller [59, Vol. I]. 
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Actually, this theorem is an oddity. Because, even though it is a strong 
limit theorem, it is a second-moment theorem and its proof consists of in- 
genious uses of the Borel-Cantelli lemma combined with the central limit 
theorem. We give an illuminating proof in Chapter 13. 


Problem 23. Use the Kolmogorov zero-one law and the central limit theorem 
to prove (3.51) for fair coin-tossing. 


Remark. The important theorems for independence come up over and 
over again as their contents are generalized. In particular, the random signs 
problem connects with martingales (Chapter 5), the strong law of large 
numbers generalizes into the ergodic theorem (Chapter 6), and the notions of 
recurrence of sums comes up again in Markov processes (Chapter 7). 


NOTES 


The strong law of large numbers was proven for fair coin-tossing by Borel 
in 1909. The forms of the strong law given in this chapter were proved by 
Kolmogorov in 1930, 1933 [92], [98]. The general solution of the random 
signs problem is due to Khintchine and Kolmogorov [91] in 1925. A special 
case of the law of the iterated logarithm was proven by Khintchine (88, 
1924]. 

The work on recurrence is more contemporary. The theorems of Section 
7 are due to Chung and Fuchs [18, 1951], but the neat proof given that 
EX = 0 implies recurrence was found by Chung and Ornstein [20, 1962]. 
But this work was preceded by some intriguing examples due to Polya 
(116, 1921}. (These will be given in Chapter 7). The work of Harris and 
Robbins (loc. cit.) on the equidistribution of sums appeared in 1953. 

There is a bound for sums of independent random variables with zero 
means which is much more well-known than the Skorokhod’s inequality, 
that is, 


P(max \S,| > «) < >) EX. 
kin E 1 


S, = X, +--+ + Xe This is due to Kolmogorov. Compare it with the 
Chebyshev bound for P(|S,| > €). A generalization is proved in Chapter 5. 

The strong law for identically distributed random variables depends 
essentially on E |X| < œ. One might expect that even if E |X| = oo, there 
would be another normalization N,, f œ such that the normed sums 


X +e +X, 
N 


n 


converge a.s. One answer is trivial; you can always take N, increasing so 
rapidly that a.s. convergence to zero follows. But Chow and Robbins [14] 
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have obtained the strong result that if E |X| = œ, there is no normalization 
N,, T œ such that one gets a.s. convergence to anything but zero. 

If E|X,} < œ, then if the sums are nonrecurrent, either S, — +œ a.s. 
or S, + — œ a.s. Butif E |X,| =œ, the sums can be transient and still change 
sign an infinite number of times; in fact, one can get iim S, = +0 a.s., 
lim S, = — œ a.s. Examples of this occur when X,, X, . . . have one of the 
symmetric stable distributions discussed in Chapter 9. 

Strassen [135] has shown that the law of the iterated logarithm is a 
second-moment theorem in the sense that if EX, = 0, and 


fim tts < was, 
Jn log (log n) 
then EX? < œ. There is some work on other forms of this law when 
EX? = œ, but the results (Feller [54]) are very specialized. 

For more extensive work with independent random variables the most 
interesting source remains Paul Lévy’s book [103]. Loéve's book has a good 
deal of the classical material. For an elegant and interesting development of 
the ideas of recurrence see Spitzer [130]. 


CHAPTER 4 


CONDITIONAL PROBABILITY 
AND CONDITIONAL EXPECTATION 


1. INTRODUCTION 


More general tools need to be developed to handle relationships between 
dependent random variables. The concept of conditional probability—the 
distribution of one set of random variables given information concerning 
the observed values of another set—will turn out to be a most useful tool. 

First consider the problem: What is the probability of an event B 
given that A has occurred? If we know that œw € A, then our new sample space 
is A. The probability of B is proportional to the probability of that part of 
it lying in A. Hence 


Definition 4.1. Given (Q, F, P), for sets A, BEF, such that P(A) > 0, the 
conditional probability of B given that A has occurred is defined as 


P(B A A) 
P(A) 


and is denoted by P(B | A). 


This extends immediately to conditioning by random variables taking only a 
countable number of values. 


Definition 4.2. If X takes values in {x,}, the conditional probability of A given 
X = x, is defined by 
P (A, X= Xz) 
P(X = x,) 
if P(X = x,) > 0 and arbitrarily defined as zero if P(X = x,) = 0. 


Note that there is probability zero that X takes values in the set where the 
conditional probability was not defined by the ratio. P(A | X = x,) is a 
probability on F, and the natural definition of the conditional expectation 
of a random variable Y given X = x, is 


P(A|X = x) = 


(4.3) EVY|X=x,) = f Y(w)P(dw |X = x) 


if the integral exists. 
67 
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What needs to be done is to generalize the definition so as to be able to 
handle random variables taking on nondenumerably many values. Look at the 
simplest case of this: Suppose there is a random variable X on (Q, F, P), 
and let Ae ¥. If Be KR, is such that P(X e B) > 0, then as above, the 
conditional probability of A given X € B, is defined by 


P(A, X € B) 


P(A |X €B) = AT 


But suppose we want to give meaning to the conditional probability of A 
given X(w) = x. Of course, if P(X = x) > 0, then we have no trouble and 
proceed as in 4.2. But many of the interesting random variables have the 
property that P(X = x) = 0 for all x. This causes a fuss. An obvious thing 
to try is taking limits, i.e., to try defining 


P(A, X E(x — hx + A) 
P(Xe(x —h,x +h)) ` 


In general, this is no good. If P(X = xo) = 0, then there is no guarantee, unless 
we put more restrictive conditions on Pand X, that the limit above will exist for 


(4.4) P(A|X =x) = lim 
nto 


xX = Xo 


So either we add these restrictions (very unpleasant), or we look at the 
problem a different way. Look at the limit in (4.4) globally as a function of x. 
Intuitively, it looks as though we are trying to take the derivative of one 
measure with respect to another. This has a familiar ring; we look back to 
see what can be done. 

On B, define two measures as follows: Let 


(4.5) O(B) = P(A, X € B), 
P(B) = P(X e B). 


Note that 0 < Q(B) < P(B) so that Ô is absolutely continuous with respect 
to Ê. By the Radon-Nikodym theorem (Appendix A.30) we can define the 
derivative of Ô with respect to Ê, which is exactly what we are trying to do 
with limits in (4.4). But we must pay a price for taking this elegant route. 
Namely, recall that dQ/dP is defined as any %,-measurable function g(x) 
satisfying 


(4.6) O(B) =f ọ(x)Ê(dx), all Be $. 
B 


If satisfies (4.6) so does ’ if p = g’ a.s. Ê. Hence this approach, defining 
P(A | X = x) as any function satisfying (4.6) leads to an arbitrary selection 
of one function from among a class of functions equivalent (a.s. equal) under 
Ê. This is a lesser evil. 
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Definition 4.7. The conditional probability P(A | X = x) is defined as any 
B -measurable function satisfying 


P(A, X € B) =Í P(A| X = x)P(dx), all Be $. 
B 


In 4.7 above, P(A | X = x) is defined as a B -measurable function g(x), 
unique up to equivalence under Ê. For many purposes it is useful to consider 
the conditional probability as a random variable on the original (Q, F, P) 
space, rather than the version above which resembles going into representa- 
tion space. The natural way to do this is to define 


P(A | X(@)) = 9(X()). 


Since p is 8,-measurable, then y(X(w)) is a random variable on (Q, F). 
Since any two versions of p are equivalent under Ê, any two versions of 
P(A | X(w)) obtained in this way are equivalent under P. But there is a more 
direct way to get to P(A | X), analogous to 4.7. Actually, what is done is 
just transform 4.7 to (Q, F, P). 


Definition 4.8. The conditional probability of A given X(w), is defined as any 
random variable on Q, measurable ¥(X), and satisfying 


P(A. eB) = | 


(Xe 


P(A|X)dP, all BEB. 
B} 


Any two versions of P(A | X) differ on a set of probability zero. 


This gives the same P(A | X) as starting from 4.7 to get p(X(w)), where 
g(x) = P(A | X = x). To see this, apply 2.41 to 4.7 and compare the result 
with 4.8. A proof that is a bit more interesting utilizes a converse of 2.31. 


Proposition 4.9. Let X be a random vector on (Q, F) taking values in (R, B). 
If Z is a random variable on (Q, F), measurable F(X), then there is a random 
variable 6(x) on (R, B) such that 


Z = A(X). 
Proof. See Appendix A.21. 


The fact that P(A | X) is #(X)-measurable implies by this proposition that 
P(A | X) = 0(X), where 6(x) is By-measurable. But 6(X) satisfies 


P(A, X € B) = Í us 6(X(w)) dP = Í, 6(x)P(dx) 


(this last by 2.41). Hence 6 = ọ a.s. P. 
We can put 4.8 into a form which shows up a seemingly curious phenome- 
non. Since F(X) is the class of all sets {X € B}, B € B,, P(A | X) is any 
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random variable satisfying 
(4.10) P(A A D) =| P(A|X)dP, all De F(X). 
D 

From this, make the observation that if X, and X, are two random variables 
which contain the same information in the sense that F(X) = ¥(X,), then 
(4.11) P(A |X) = P(A |X) as. 
In a way this is not surprising, because ¥(X,) = ¥(X,) implies that X, 
and X, are functions of each other, that is, from 4.9, 

Xw) = 6,(X,(@)), Xw) = 02(X2(w)). 


The idea here is that P(A | X) does not depend on the values of X, but rather 
on the sets in F that X discriminates between. 

The same course can be followed in defining the conditional expectation of 
one random variable, given the value of another. Let X, Y be random variables 
on (Q, F, P). What we wish to define is the conditional expectation of Y given 
X = x, in symbols, E(Y | X = x). If B e B were such that P(X € B) > 0, 
intuitively E(Y | X € B) should be defined as f Y(w)P(dw | X € B), where 
P(: | X e B) is the probability on F defined as 


P(A | X € B) = P(A, X € B)P(X € B). 


Again, we could take B = (x — h, x + h), let h | 0, and hope the limit 
exists. More explicitly, we write the ratio 


SY(w)P(dw, X € B) 
P(X €B) 


and hope that as P(X € B) — 0, the limiting ratio exists. Again the derivative 
of one set function with respect to another is coming up. What to do is 
similar: Define 


Ô(B) = Í Y(w)P(dw, X € B), 
P(B) = P(X €B). 
To get things finite, we have to assume E |Y| < œ; then 
IÔ®B)I < f IYI P(dw, X € B) < E |Y. 


To show that Ô is o-additive, write it as 


6(B) = f 


{Xe 


Y(w) dP. 
B} 
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Now {B,,} disjoint implies A, = {X € B,} disjoint, and 


AUB) =f Yap =5 | Yar = 50B 

VUAn n JAn n 
Also, P(B) = 0 => O(B) = 0, thus @ is absolutely continuous with respect 
to Ê. This allows the definition of E(Y |X = x) as any version of dO/dP. 


Definition 4.12. Let E\Y¥| < œ, then E(Y |X = x) is any 8,-measurable 
function satisfying 


Í E(Y | X = x) dP(x) = f Y(w)dP(w), all Be $. 
B {XeB} 


Any two versions of E(Y | X = x) are a.s. equal under Ê. 


Conditional expectations can also be looked at as random variables. 
Just as before, if p(x) = E(Y | X = x), E(Y | X) can be defined as y(X(«)). 
Again, we prefer the direct definition. 


Definition 4.13. Let E \|Y| < œ; then E(Y | X) is any ¥(X)-measurable function 
satisfying 


(4.14) Í E(Y | X) dP = Í Y dP, all Ac F(X). 
A A 


The random variable Y trivially satisfies (4.14), but in general Y 4 E(Y | X) 
because Y(w) is not necessarily ¥(X)-measurable. This remark does discover 
the property that if F(Y) = F(X) then E(Y | X) = Y a.s. Another property 
in this direction is: Consider the space of #(X)-measurable random variables. 
In this space, the random variable closest to Y is E(Y | X). (For a defined 
version of this statement see Problem 11.) 

Curiously enough, conditional probabilities are a special case of con- 
ditional expectations. Because, by the definitions, 


PAIRS D= Hye) | KES as 8 
P(A | X) = E(z4(@) |X) a.s. P. 


Therefore, the next section deals with the general definition of conditional 
expectation. 


Definition 4.15. Random variables X, Y on (Q, F, P) are said to have a joint 
density if the probability ÊC) defined on B, by P(F) = P((Y, X) € F) is 
absolutely continuous with respect to Lebesgue measure on Bg, that is, if there 
exists f(y, x) on R®, measurable Bo, such that 


P(Y, X EF) = Í, fox) dy dx, all Fe By 
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Then, by Fubini’s theorem (see Appendix A.37), defining f(x) = J fly, x) dy, 
forall Be By, 


P(B) = P(X € B) =Í f(x)dx. 
B 


(Actually, any o-finite product measure on B, could be used instead of 
dy dx.) If a joint density exists, then it can be used to compute the conditional 
probability and expectation. This is the point of Problems 2 and 3 below. 


Problems 


1. Let X take on only integer values. Show that P(A | X = x) as defined in 
4.7 is any B -measurable function p(x) satisfying 


P(A, X = j) = (j)P(X = j), allj. 


Conclude that if P(X = j) > 0, then any version of the above conditional 
probability satisfies 
P(A, X = j) 


POIR T 


2. Prove that if X, Y have a joint density, then for any B € By, 


P(Y €B|X = x)= a aa, a.s. P. 


3. If (Y, X) have a joint density f(y, x) and E|Y| < 0, show that one 
version of E(Y | X = x) is given by 


fO, x) 
y=—— dy. 
f(x) 
4. Let X,, X; take values in {1,2,..., N}. If ¥(X,) = F(X,), then prove 
there is a permutation {i,, ig,..., iy} of {1, 2,..., N} such that 
A; = {X =j} = {X, = i}. 
Let P(4;) >0,j=1,..., N. Prove that 


N 
P(A |X) = P(A | X.) = È tal) ` P(A | A;) a.s. 


5. Let Q = {z e [~1, +1}, F = B(—1, +1)), P = dz/2, and X,(z) = z2. 
Show that one version of P(A | X,) is 


P(A | X) = 2x42) + 444(—2)- 
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Let X,(z) = z4; find a version of P(A | X,). Find versions of P(A | X, = x), 
P(A | Xe = x). 

6. Given the situation of Problem 5, and with Y a random variable such that 
E|Y(z)| < œ, show that 


E(Y(z) | X:(z)) = 4¥(z) + 4¥(—z), a.s. dz. 


Find a version of 


E(Y(z) | X:(z) = x). 


7. If X and Y are independent, show that for any B € 8, P(Y e B| X) = 
P(Y € B)a.s. For E|Y| < œ, show that E(Y | X) = EY as. 


8. Give an example to show that E(Y | X) = EY a.s. does not imply that 
X and Y are independent. 


9. (Borel paradox). Take Q to be the unit sphere S‘? in R, F the Borel 
subsets of Q, P(-) the extension of surface area. Choose two opposing points 
on S‘? as the poles and fix a reference half-plane passing through them. 
For any point p, define its longtitude p(p) as the angle between —za and m 
that the half-plane of the great semi-circle through p makes with the reference 
half plane. Define its latitude ĝ(p) as the angle that the radius to p makes with 
the equatorial plane, —7/2 < 6(p) < 7/2. Prove that the conditional prob- 
ability of y given @ is uniformly distributed over [—7, 7) but that the con- 
ditional probability of 6 given y is not uniformly distributed over (—7/2, 
7/2]. (See Kolmogorov [98, p. 50].) 


2. A MORE GENERAL CONDITIONAL EXPECTATION 


Section 1 pointed out that E(Y | X) or P(A | X) depended only on F(X). 
The point was that the relevant information contained in knowing X(w) is 
the information regarding the location of œw. Let (Q, F, P) be a probability 
space, Y a random variable, E |Y| < œ, D any o-field, D © F. 


Definition 4.16. The conditional expectation E(Y | D) is any random variable 
measurable (Q, D) such that 


(4.17) Í E(Y | D) dP = f YdP, al DED. 
D D 


As before, any two versions differ on a set of probability zero. Jf D = 
F(X, Xa, .. -), denote E(Y | D) = E(Y | Xi, Xa, . - .). 
If X is a random vector to (R, B), then for x € R 


Definition 4.18. E(Y |X = x) is any random variable on (R, B), where 
P(B) = P(X € B), satisfying 


f EYI x= x48 = YdP, allBeS. 
B } 


{XE B 
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The importance of this is mostly computational. By inspection verify that 
(4.19) G(x) = E(Y|X =x) => E(Y|X) = gX). 


Usually, E(Y | X = x) is easier to compute, when densities exist, for example. 
Then (4.19) gets us E(Y | X). 


Proposition 4.20. A list of properties of E(Y | D), 


1) E(aY, + Ya | D) = «E ( Yıl D) + BEY: | D)a.s., assuming E |Y,| < ©, 
E |Y < œ. 

2) Y > 0, EY < œ, => E(Y | D) > Oas. 

3) E|Y| < œ, D € 6, implies 


E(E(Y |8) | D) = E(Y |D) as. 
4) F(Y) independent of D, E\Y| < œ, > E(Y | D) = EY a.s. 


Proofs. These proofs follow pretty trivially from the definition 4.16. To 
improve technique IIl briefly go through them; the idea in all cases is to 
show that the integrals of both sides of (1), (2), (3), (4) over D sets are the 
same, (in 2, >). Let DeD. Then by 4.16 the integrals over D of the left 
hand sides of (1), (2), (3), (4) above are 


1) Í («Yı + BY,)dP 2) Í Y dP 
D D 
3) Í E(Y | &) dP 4) Í Y dP. 
D D 
The right-hand sides integrated over D are 
1°) af Yı dP + ef YodP 7) 0 
D D 
3’) Í Y dP 4) (EY): P(D) 
D 
So (1) = (1’) is trivial, also (2) > (2’). For (4), write 
Í YdP = feo dP = E%pEY = P(D)- EY. 
D 
For (3), (3°) note that by 4.16, 
fJ EYIDaP = | var, all Fes. 
F F 


But D © &, so (3) = (3°), all De D. 
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An important property of the conditional expectation is, if E |Y| < œ, 
(4.21) E(E( | D)) = EY. 


This follows quickly from the definitions. 
Let Y = 7,(); then the general definition of conditional probability is 


Definition 4.22. Let D be a sub-o-field of F. The conditional probability of 
A €F given D is a random variable P(A | D) on (Q, D) satisfying 
Í P(A|D)dP = P(A A D), all DED. 
D 
By the properties in 4.20, a conditional probability acts almost like a prob- 
ability, that is, 
(4.23) P(A|D)>0 as., PQ{D)=1 as. 


A,,..., A, disjoint > P(UA, | D) = > P(Ag |D) a.s. 
1 


It is also o-additive almost surely. This follows from 


Proposition 4.24. Let Y,, > 0 be random variable such that Y,, t Y a.s. and 
E|Y| < œ. Then E(Y, | D)— E(Y | D) as. 


Proof. Let Z, = Y — Y,,so Z, | 0 a.s., and EZ, | 0. By 4.20(2) Z, > 
Zn > E(Z,|D) > E(Zn |D) a.s. Therefore the sequence E(Z, |D) 
converges monotonically downward a.s. to a random variable U > 0. By 
the monotone convergence theorem, 


lim E(E(Z,, | D)) = EU. 
Equation (4.21) now gives EU = lim EZ, = 0. Thus U = 0 a.s. 
Let A, € F, {A,} disjoint, and take 
Ya = > XAy 
k=1 
in the above proposition to get from (4.23) to 
(4.25) P(UA, | D) = $ P(A, |D) a.s. 
For A fixed, P(A | D) is an equivalence class of functions f(A, œ). It seems 
reasonable to hope from (4.25) that from each equivalence class a function 
f *(A, ) could be selected such that the resulting function P*(A | D) on 


F x Q would be a probability on F for every œ. If this can be done, then 
the entire business of defining E(Y | D) would be unnecessary because it 
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could be defined as 
(4.26) E(Y |D) = f YoyPrde | D). 


Unfortunately, in general it is not possible to do this. What can be done is a 
question which is formulated and partially answered in the next section. 


Problems 


10. If D has the property, D € D => P(D) = 0, I, then show E(Y | D) = EY 

a.s. (if E |Y| < œ). 

11. Let Y be a random variable on (Q, F), EY? < œ. For any random 

variable X on (Q, F), let d(Y, X) = E |Y — Xl. 

a) Prove that among all random variables X on (Q, D), D € F, there is an 
a.s. unique random variable Yọ which minimizes d(Y, X). This random 
variable Y, is called the best predictor of Y based on D. 

b) Prove that Y, = E(Y | D) a.s. 

12. Let Xo Xi, Xz- .., X, be random variables having a joint normal 

distribution, EX, = 0,1,, = E(X,X,). Show that E(X, | Xi, Xa .-., Xa) = 


n 
> A,X; a.s., and give the equations determining A, in terms of the T,;. 


1 

(See Chapter 9 for definitions.) 

13. Let X be a random vector taking values in (R, B), and g(x) a random 
variable on (R, B). Let Y be a random variable on (Q, F) and E |Y] < œ, 
E |¢(X)| < œ, E |Yg(X)| < æ. Prove that 

a) E(y(X)Y | X) = g(X)E(Y | X) a.s. P. 

b) E(p(X)Y | X = x) = gx)E(Y | X = x) a.s. Ê. 

[This result concurs with the idea that if we know X, then given X, 
(X) should be treated as a constant. To work Problem 13, a word to the 
wise: Start by assuming ¢(X) > 0, Y > 0, consider the class of random 
variables y for which it is true, and apply 2.38.] 

14. Let Y be a random variable on (Q, F, P) such that E|Y| < œ and 
X,, X, random vectors such that F(Y, X,) is independent of F(X,), then 


prove that 
E(Y | X;, X) = E(Y | X,) a.s. 


15. Let X,, X2, . . . be independent, identically distributed random variables, 
E |X,| < œ, and denote S, = X, ++- + X,. Prove that 


E(X | Sh Spis <- -) = S,/n a.s. 
[Use symmetry in the final step.] 
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3. REGULAR CONDITIONAL PROBABILITIES AND DISTRIBUTIONS 
Definition 4.27. P*(A | D) will be called a regular conditional probability on 
F, € F, given D if 
a) For AEF, fixed, P*(A | D) is a version of P(A | D). 
b) For any w held fixed, P*(A | D) is a probability on F,. 

If a regular conditional probability, given D, exists, all the conditional 
expectations can be defined through it. 


Proposition 4.28. If P*(A | D) is a regular conditional probability on F, and 
if Y is a random variable on (Q, F,), E |Y] < œ, then 


EY |D) = f Y(w)P*(dw|D) a.s. 


Proof. Consider all nonnegative random variables on (Q, ¥,) for which 4.28 
holds. For the random variable y4, A E F, 4.28 becomes 


P(A |D) = P*(A|D) a.s. 


which holds by 4.27. Hence 4.28 is true for simple functions. Now for 
Y 20, EY < œ take Y, simple ft Y. Then by 4.24 and the monotone 
convergence theorem, 


E(Y | D) = lim E(Y, | D) = lim f Y, dP*(dw | D) = f YP*(dw|D) as. 


Unfortunately, a regular conditional probability on ¥, given D does not 
exist in general (see the Chapter notes). The difficulty is this: by (4.25), 
for A, € F,, disjoint, there is a set of probability zero such that 


P(UA,|D) + È P(A, | D), 


If F, contains enough countable collections of disjoint sets, then the ex- 
ceptional sets may pile up. 

By doing something which is like passing over to representation space 
we can get rid of this difficulty. Let Y be a random vector taking values in 
a space (R, B). 

Definition 4.29. P(B | D) defined for BEB and we€Q is called a regular 
conditional distribution for Y given D if 

i) for Be SB fixed, P(B | D) is a version of P(Y e€ B | D). 

ii) for any w € Q fixed, P(B | D) is a probability on $B. 


If Y is a random variable, then by using the structure of R® in an essential 
way we prove the following theorem. 
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Theorem 4.30. There always exists a regular conditional distribution for a 
random variable Y given D. 


Proof. The general proof we break up into steps. 


Definition 4.31. F(x|D) on R® x Q is a conditional distribution function 
for Y given D if 

i) F(x |D) is a version of P(Y < x | D) for every x. 

li) for every w, F(x | D) is a distribution function. 


Proposition 4.32. There always exists a conditional distribution function for 
Y given D. 


Proof. Let R = {r,;} be the rationals. Select versions of P(Y < r, |D) and 
define 

M; = {@; P(Y < 7r;|D) < P(Y < r,| D)}, 

M = U M,,. 

ror 
So M is the set on which monotonicity is violated. By 4.20(2) P(M) = 0. 
Define 
N, = [os lim PCY <r; |D) P(Y < rD), N=UN,. 
Since Zio, (Y) T X—a,ry(Y) 4.24 implies P(N,) = 0, or P(N) = 0. Finally 
take r, ? œ orr;| — œ and observe that the set L on which lim P(Y < r; | D) 
fi: 


fails to equal one or zero, respectively, has probability zero. Thus, for w in the 
complement of M U N U L, P(Y < x| D) for xe R is monotone, left- 
continuous, zero and one at — œ, + œ, respectively. Take G(x) an arbitrary 
distribution function and define: 


G(x) wEMUNUL, 
F(x | D) = 


lim P(Y < r;| D), otherwise. 


Trjg 


It is a routine job to verify that F(x | D) defined this way is a distribution 
function for every œ. Use 4.24 again and X% o,r (Y) T %i-20,2)(Y) to check 
that F(x | D) is a version of P(Y < x | D). 

Back to Theorem 4.30. For Y a random variable, define A(- | D), for 
each œw, as the probability on (R®, B,) extended from the distribution 
function F(x |D). Let C be the class of all sets C € B, such that P(C | D) 
is a version of P(Y € C |D). By 4.23, C contains all finite disjoint unions of 
left-closed right-open intervals. By 4.24, C is closed under monotone limits. 
Hence C = B,. Therefore, P(- | D) is the required conditional distribution. 
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For random vectors Y taking values in (R, B), this result can be extended 
if (R, B) looks enough like (R", By). 


Definition 4.33. Call (R, B) a Borel space if there is a E € B, and a one-to-one 
mapping y:R > E such that g is B-measurable and py is By-measurable. 


Borel spaces include almost all the useful probability spaces. For 
example, see the proof in Appendix A that (R‘“), B,,) is a Borel space. So, 
more easily, is (R'”), B,),n > 1. 


Theorem 4.34. If ¥ takes values ina Borel space (R, 3), then there is a regular 
conditional distribution for Y given D. 


Proof. By definition, there is a one-to-one mapping y:R+> E e B, with 
p, g> measurable B, B, respectively, Take Y = ọ(Y); Y is a random 
variable so there is a regular conditional distribution P(A|D) = P(YEA|D) 
a.s. Define P(B|D), for B e $, by 


P(B|D) = Ê (p(B) | D). 


Because y(B) is the inverse set mapping of the measurable mapping ¢~1(x), 
Ê(: | D) is a probability on & for each w, and is also a version of P(Y € B | D) 
for every Be 3. 


Since the distribution of processes is determined on their range space, a 
regular conditional distribution will suit us just as well as a regular conditional 
probability. For instance, 


Proposition 4.35. Let Y be a random vector taking values in (R, B), y a point in 
R, y any B-measurable function such that E \p(¥)| < œ. Then if P(- | D) is a 
regular conditional distribution for Y given D, 


Eo) D) = | a)P(@y|D) as. 
Proof. Same as 4.28 above. 


If Y, X are two random vectors taking values in (R, B), (S, &) respectively, 
then define a regular conditional distribution for Y given X = x, P(B| X = x), 
in the analogous way: for each x € S, it is a probability on B, and for B 
fixed it is a version of P(B|.X = x). Evidently the results 4.34 and 4.35 
hold concerning P(B | X = x). Some further useful results are: 


Proposition 4.36. Let (x, y) be a random variable on the product space 
(R, B) x (S, 8), E |p(X, Y)| < œ. 1f Ê¢ | X = x) is a regular conditional 
distribution for Y given X = x, then, 


(4.37) E(¢(X, Y) |X =x) = f pa y)P(dy |X =x) as. 
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[Note: The point of 4.36 is that x is held constant in the integration occurring 
in the right-hand side of (4.37). Since g(x, y) is jointly measurable in (x, y), 
for fixed x it is a measurable function of y.] 


Proof. Let (x, y) = %c(x)xzp(y), C, D measurable sets, then, by Problem 
13(b). a.s. Ê, 


E(xc(X)xp(¥) |X = x) = xe(x)E(x40(¥) | X = x) 


= %o(x)P(Y e D| X = x). 
On the right in 4.37 is 


tolx) f Xo(Y)Ê(dy |X = x) = yo(x)P(D |X = x) 


which, by definition of Ê, verifies (4.37) for this g. 

Now to finish the proof, just approximate in the usual way; that is, 
(4.37) is now true for all p(x, y) of the form $., Xxc;(x¥)xXp;(Y), Ci D; meas- 
urable sets, and apply now the usual monotonicity argument. 


One can see here the usefulness of a regular conditional distribution. 
It is tempting to replace the right-hand side of (4.37) by E(&(x, Y) | X = x). 
But this object cannot be defined through the standard definition of con- 
ditional expectation (4.18). 


A useful corollary is 


Corollary 4.38. Let X, Y be independent random vectors. For g(x, y) as in 
4.36, 
E(¢(X, Y) | X = x) = Eg(x, Y) a.s. 


Proof. A regular conditional distribution for Y given X = x is P(Y € B). 
Apply this in 4.36. 


Problems 
16. Let 7 be any interval in R®. A function (x) on J measurable B (7) 
is called convex if for all ¢ € [0, 1] and xe 7, yel 

p(tx + (1 — ty) < tp) + (1 — Dp). 


Prove that if Y is a random variable with range in J, and E|Y| < œ, then 
for g(x) convex on J, and E|g(Y)| < œ, 

a) p(EY) < Efg(Y)] (Jensen’s inequality), 

b) p{E(Y | D)) < E(G(Y) | D) as. 

[On (b) use the existence of a regular conditional distribution. ] 
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17. Let X, Y be independent random vectors. 


a) Show that a regular conditional probability for Y given X = x is 
P(Y € B). 

b) If p(x, y) is a random variable on the product of the range spaces, show 
that a regular conditional distribution for Z = (X, Y) given X =x 
is P(y(x, Y) € B). 


NOTES 


The modern definition of conditional probabilities and expectations, using 
the Radon-Nikodym theorem, was formulated by Kolmogorov in his mono- 
graph of 1933 [98]. 

The difficulty in getting a regular conditional probability on F(X), 
given any o-field D, is similar to the extension problem. Once P(-| D) is 
gotten on the range space (R, B) of X, if for any set B € B contained in the 
complement of the range of X, P(B| D) = 0, then by 2.12 we can get a 
regular conditional probability P*(- | D) on F(X) given D. In this case it is 
sufficient that the range of X be a set in B. Blackwell’s article [8] also deals 
with this problem. 

The counterexample referred to in the text is this: Let Q = [0, 1], D = 
&((0, 1]). Take C to be a nonmeasurable set of outer Lebesgue measure one 
and inner Lebesgue measure zero. The smallest o-field F containing D 
and C consists of all sets of the form A = (C N B,) U (C° A By), where 
B,, B, are in D. Define P on F by: If A has the above form, 


P(A) = 41(B,) + 41(B,). 


Because the B, B, in the definition of A are not unique, it is necessary to 
check that P is well defined, as well as a probability. There does not exist 
any regular conditional probability on F given D. This example can be 
found in Doob [39, p. 624]. 


CHAPTER 5 


MARTINGALES 


1. GAMBLING AND GAMBLING SYSTEMS 


Since probability theory started from a desire to win at gambling, it is only 
sporting to discuss some examples from this area. 


Example 1. Let Z,, Za, ... be independent, Z; = +1, —1 with probability 
Pp, 1 — p. Suppose that at the nth toss, we bet the amount b. Then we receive 
the amount b if Z, = 1, otherwise we lose the amount b. A gambling 
strategy is a rule which tells us how much to bet on the (n + 1) toss. To be 
interesting, the rule will depend on the first n outcomes. In general, then, a 
gambling strategy is a sequence of functions b,,:{—1, +1}'") — [0, œ) such 
that 5,(Z,,..., Z,) is the amount we bet on the (n + 1) game. If we start 
with initial fortune So, then our fortune after n plays, S,, is a random variable 
defined by 


(5.1) Sayi = Sp + Zn419 (Zi, seg Zp). 


We may wish to further restrict the b, by insisting that b,(Z,,..., Za) < S,. 
Define the time of ruin n* as the first n such that S, = 0. One question that 
is certainly interesting is: What is P(n* < œ); equivalently what is the 
probability of eventual ruin? 

There is one property of the sequence of fortunes given by (5.1) that I 
want to focus on. 


Proposition 5.2. For p = 3, 


ElS nya | Sm Spis ++ S1) = S, as. m= 1,2,... 
For p < 3; 
ES pyi | Sus Si eens S,) < Sp a.S, Nn 


I 
pan 
P 


Proof 
ES py | 7A see's Z) == E(S,, + Z4410,(Z1, Towy Zn) | Zas E J Z,) 
3 S, + b,(2), Rec | ZJE(Zn+1)- 
The last step follows because S, is a function of Z,,...,Z,. If p = 4, 
then £Z,,, = 0, otherwise EZ,,, < 0, but 6, is nonnegative. Thus, for 
P = $, ES nta | Li eag Z) = S, a.s. For p < 3, ES psi | Le ese Z,) < 
S, as. Note that #(S,,...,$,) © #(Z,,...,2Z,). Taking conditional 
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expectations of both sides above with respect to S}, ...,S, concludes the 
proof. 


Example 2. To generalize Example 1, consider any process Z,, Za.. 
With any system of gambling on the successive outcomes of the Z,, Za... 
the player’s fortune after n plays, S, is a function of Z,,...,Z,. We assume 
that corresponding to any gambling system is a sequence of real-valued 
functions y,, measurable B, such that the fortune S, is given by 
?A(Zi, wees Za) and E IS, < 00, 

Definition 5.3. The sequence of games Z,, Z,,..., under the given gambling 
system is called fair if 

E(Syi1|Sa++-,5)3) = Sp as, m= 1,2,..., 
unfavorable if 

E(Sn41 | Smee a 91) SS, as. Nn = 1, 25 eee 


The first function of probability in gambling was involved in computation 
of odds. We wish to address ourselves to more difficult questions such as: 
Is there any system of gambling in fair or unfavorable games that will yield 
S, —> œ a.s.? When can we assert that P(n* < œ) = 1 for a large class of 
gambling systems? This class of problems is a natural gateway to a general 
study of processes behaving like the sequences S,,. 


2. DEFINITIONS OF MARTINGALES AND SUBMARTINGALES 


Definition 5.4. Given a process X,, Xz, ... It will be said to form a martingale 
(MG) if E |X;| < œ, k = 1, 2,... and 


(5.5) E(Xy41 | Xp.) X1) = Xn as, n= 1,2,... 
It forms a submartingale (SMG) if E |X;,| < œ, k = 1,2,..., and 
(5.6) E(Xni| Xos- -s X1) > Xa as, n= 1,2,.. 


Example 3. Sums of independent random variables. Let Y}, Ya... bea 
sequence of independent random variables such that Æ |Y,| < œ, all k, 
EY, =0, k=1,2,... Define X,=Y,+°°-+ Y,; then £|X,| < 


> E|Y,| < œ. Furthermore, ¥(X;,...,X,) = F(Yz,..., Yn), SO 
1 
E(X nı | Xn 2.2e3 Xı) = E(Xn41 | Yu oney Yı) 
T EV a1 | Yw ae) Yı) + Xn = Xn 


As mentioned in connection with fortunes, if the appropriate inequalities 
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hold with respect to larger o-fields, then the same inequalities hold for the 
process, that is, 
Proposition 5.7. Let X,, Xs, ... be a process with E|X,| < œ, k =1,2,... 
Let Y,,..., be another process such that #(X,,..., Xn) © F(Vy, .-» Va), 
n=1,2,... If 
E(X pti | Vu zarg Y) = Xn aS., 
=) 
then X,, Xq,... is a SMG(MG). 
Proof. Follows from 
E(E Xna l Yur =< <2 Ya) | Xm -0a X) = EX | Mare es MD 
[See 4.20 (3).] 
For any m > n, if X,,... is a SMG(MG), then 


E(X,,| Xu. ++ X) 2X, ass 
(=) 
because, for example, 
EX ate | Xp E X) = E(E(Xns2 | Xati X, anny X) | Xw eres X,) 
> E(X |X. X) 2 Xa as. 


This remark gives us an equivalent way of defining SMG or MG. 


Proposition 5.8. Let X,, Xa, ... be a process with E |X,| < œ, all k. Then it 
is a SMG(MG) iff for every m > n and A € F(X, ..., Xx) 


[82 J, X, dP. 


Proof. By definition, f 4 Xm dP = fa EXm | Xm + Xi) dP. 


Problem 1. Let X,, Xa... be a MG or SMG; let J be an interval 

containing the range of X,, = 1, 2,..., and let g(x) be convex on J. 

Prove 

a) If E\g(X,)| < ©, n = 1,2,... and X, Xp... is a MG, then X; = 
P{X,,) is a SMG. 

b) If E|g(X,)| < 0,” =1,..., v(x} also nondecreasing on J, and Xj, Xg... 
a SMG, then X; = o({X,) is a SMG. 


3. THE OPTIONAL SAMPLING THEOREM 


One of the most powerful theorems concerning martingales is built around 
the idea of transforming a process by optional sampling. Roughly, the idea 
is this—starting with a given process X,, X,,..., decide on the basis of 
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observing X,,..., X, whether or not X, will be the first value of the trans- 
formed process; then keep observing until on the basis of your observations 
a second value of the transformed process is chosen, and so forth. More 
precisely, 


Definition 5.9. Let X,, Xz, . . . be an arbitrary process. A sequence M,,™M,... 
of integer-valued random variables will be called sampling variables if 


a) ILM íM... 

b) {m, = j} E F(X,,..., X3). ` 

Then the sequence of random variables X,, defined by X, = Xm, is called the 
process derived by optional sampling from the original process. 


Theorem 5.10. Let X,, Xo,...bea SMG (MG), mj, mg, . . . sampling variables, 
and X,, the optional sampling process derived from X,, Xa... If 


a) E |X| < œ, all k, 
b) lim f IXa] dP =0, alln, 
n>N} 


then the X., Xa ... process is a SMG(MG). 
Proof. Let A € (X, seat X,,)- We must show that 


[žara |, X, dP. 


Let D, = A N {m, =j}. Then A = U D; and it suffices to show that 
f Xni1 dP > f xX, dP, alj. 
D; (=) D; 


We assert that D, € ¥(X,,..., X,), since, letting A = (X. ies x.) e B}, 
Be B, then 


D, = (Xo... X,) EB, m, =j} 
a U (Xo -..X,) EB, M Sjo.. Maai = jn m, =j} 


Ftreees Jn~1 
AS +++ Sinas 


= ‘ U (Xas +++» Xino Xs) E B, M, = jr.. m, =j} 
U ERRELE T 


FS ++ Sinasi 


Evidently, each set in this union is in ¥(X,,..., X,). Of course, 


X,dP =| X, dP. 
Di D; 
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Now, for arbitary N > j, 


Xnai dP + Xn dP 


Di Nman=3 DAmni > N} 


D; N{mp+ =i} Dy O{mns1> N} 


— Xai) dP. 


èl, X, dP + Xy dP 


D;N{ma+1> ao 


The first two terms on the right in the above telescope. Starting with the last 
term of the sum, 


Í Xy dP + Xy dP = Xy dP. 
Di Xmas =N} D; Aima > N} DiMA may12 N} 


But {M44 > N} = {Mny < N}¥ E F(X, sey Xwn_1)- By 5.8, 


Í Xy dP > i Xx dP = Xy dP. 
Dim 2 N} (=) Diim N} D;Oimn+1> N-1} 


Hence the first two terms reduce to being greater than or equal to 


Í X, dP. 
Dj Omn412 3} 


But D; < {m,,, > j}, so we conclude that 


Í X1dP> | X,aP — (Xy — Xn) dP. 
D; 


(=) D; ae N} 


Letting N —> œ through an appropriate subsequence, by (b) we can force 
Ds Oimnsi> N} 

But {m,,, > N} | Ø, and the bounded convergence theorem gives 

Í Xn dP > 0, 

DiN{mn+1> N} 

since E |Xn41| < 00 by (a). Q.E.D. 

Corollary 5.11. Assume the conditions of Theorem 5.10 are in force, and that 

in addition, lim E |X,,| < 00, then 

1) EX, < EX, < lim EX,, 

2) E|X,| < 2lim E |X,| — EX. 
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Proof. Without loss of generality, take m, = 1. By 5.10, 


E(X, |X) > X, => EX, > EX, = EX.. 
Also, 
. N 
EX, = lim È f X, qadP, 
N j=1 J{mn=}} 
but 


N N 
i=1J (mn=3} i=1 J{mn=i} {mn> N} 
By condition (b), we may take a subsequence of N such that 
Í Xy dP —0. 

{mp >N} 
Using this subsequence, the conclusion EX, < lim EX,, follows. For 
part (2), note that if X„ is a SMG, so is Xt = max (0, X,). Thus by 5.10 
so is Xt. Applying (1) we have 

EXt < lim EX?. 
For the original process, 

EX} — EX; > EX, or EX; <EXt— EX, 
E|X,| = EXt + EX; < 2 lim EX* — EX, < 2 lim E|X,| — EX. 


Proposition 5.12. Under hypothesis (b) of Theorem 5.10, if lim E|X,,| < œ, 
then sup E |X,,| < œ, so that (a) holds and the theorem is in force. 
n 


Proof. We introduce truncated sampling variables: for M > 0, 
m, if m,<M, 


n 


a= | 
M if m,>M, 


and X, 4 = Xmp,4- The reason for this is 
3 M M 
E\X,ml => IX;| dP < DE IX;| < 0, 
i=1 J (My, w=5} j=l 


so that the conditions of Theorem 5.10 are in force for X,, m. By Corollary 
5.11, E |X„ ml S & < 0, where « does not depend on n or M. Note now 
that limy, M,,4, = ™,; hence limy X, 4 = X,. By the Fatou lemma 
(Appendix A.27), 


f tim |X,„ ml dP < lim E |X, ml < % 
M M 


or a 
E|X,| <, alln. 
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One nice application of optional sampling is 


Proposition 5.13. If X,, Xa, ... is a SMG then for any x > 0 


1) P( max X, > x) < LEX 
1<jSk x 
2) P minX, < -x) <1+(E|x,| — EX,). 
1<j<k x 


Proof. 1) Define sampling variables by 
firstj < k suchthatX,;> x, or 
G f if no such j exists. 
m, =k, n>2. 
The conditions of 5.10 are satisfied, since {m, > N} = Ø, N > k, and 
E|X,| < Š EIX; < œ. Now 
1 
P( max X, > x) = P(X, >x) <* f Xm, dP. 
1S5Sk X J (X%p,>2) 


By the optional sampling theorem, 


Í Xn, dP <Í X,dP < E|X,. 
(Xp, > 2) Xm, 7 2 


2) To go below, take x > 0 and define 
m =l, 
firstj <k such that X, < — x, or 
TA : if no such j exists, 
m, =k, n>3. 
The conditions of 5.10 are again satisfied, so 


EX,,. > EXm, = EX. 
2 1 


EX,, -| Xn dP +Í X,, dP 
. Xm,2—2} $ Xm, 572) : 


< E(X,| -— xP(X m, < —x). 


But 


Therefore, since 
P( min X; < -x) = P(X, < —x), 
1<5Sk R 


part (2) follows. 
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Problem 2. Use 5.13 and Problem 1 to prove Kolmogorov’s extension of the 
Chebyshev inequality (see the notes to Chapter 3). That is, if X,, Xa... 
is a MG and EX? < œ, then show 


P( max |X;| > :) <4 Ext. 
1SkSn E 

4. THE MARTINGALE CONVERGENCE THEOREM 
One of the outstanding strong convergence theorems is 


Theorem 5.14. Let X,, X+, . . . be a SMG such that lim E|X,,| < 00, then there 
exists a random variable X such that X„, > X a.s. and E |X| < œ. 


Proof. To prove this, we are going to define sampling variables. Let b > a 
be any two real numbers. Let 


: ee if X, >a, alln, 
m, = 
first n suchthat X, <a, otherwise; 


i if X,<b, alln> mf, 
m 


first n> mf suchthat X, > b, otherwise. 
In general, 


i = if X,>a, alln > Mý» 
first n > m, suchthat X, <a_ otherwise; 
i Aa if X,<5b, all n>m},,, 
first n> mj,,, suchthat X, > b, otherwise. 


That is, the m* are the successive times that X,, drops below a or rises above 
b. Now, what we would like to conclude for any b > a is that 


P(X, <a io. X, 2b io.) = 0. 


This is equivalent to proving that if 


that P(S) = 0. Suppose we defined X, = Xm*, and 
Z = (Xs — X3) + (Š = X) + 


On S, Xon — Xan < —(b — a), so Z = — œ, but if X, is a SMG, 
E(Xn41 — Xen) > 0, so EZ > 0, giving a contradiction which would be 
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resolved only by P(S) = 0. To make this idea firm, define 
m,m = min (M, m$), Xna = Xm, ye 
Then the optional sampling theorem is in force, and X„ m is a SMG. Define 
Zu = (Xo.m — Xam) + (Xs.m — Xam) +t, 


noting that Xanı, — Xzn.m = 0 for m*, > M, which certainly holds for 
2n > M. 
Let By, be the largest n such that mž, < M. On the set {fy = k}, 


Zu = (Xs.m = Xam) mune os ie = Xar. m) 
= (Xs. m = Xam) +e +a Xora) + CORET — a). 


The last term above can be positive only if m¥,,, > M, in which case it 
becomes (X — a). In any case, on {By = k} 


Zu < —k(b — a) + (Xmu — 2)", 
or, in general, 


(5.15) Zu < — Bulb — a) + (Xu — a}. 


On the other hand, since X,, y, is a SMG, EZy, > 0. Taking expectations of 
(5.15) gives 


(5.16) EBu < E(Xy — a)" f 


b—a 


Now fy is nondecreasing, and fẹ Î œ on S. This contradicts (5.16) since 
lim E(Xy, — ayt < lim E| Xul + a, unless P(S) = 0. This establishes 


P(U {lim X, < a < b < limX,}) = 0. 
where the union is taken over all rational a, b. Then either a random 
variable X exists such that X, — X a.s. or |X, — œ with positive proba- 
bility. This latter case is eliminated by Fatou’s lemma, that is, 
f lim |X,| dP < lim f IX,| dP. 
From this we also conclude that 


J |X| dP < lim f IX,| dP < œ. QED. 


In the body of the above proof, a result is obtained which will be useful in 
the future. 


5.5 FURTHER MARTINGALE THEOREMS 9] 


Lemma 5.17. Let X,,..., Xq be random variables such that E|X,]| < œ, 
n= 1,..., Mand E(Xjy1|X.---,X) > Xn n= l,..., M—1. Let 
Bm be the number of times that the sequence X,,..., Xy crosses a finite 


interval [a, b] from left to right. Then, 


E(Xy = ayt 


Efm < nE 


Problem 3. Now use the martingale convergence theorem to prove that for 
X Xa, . . . independent random variables, EX, = 0, k = 1,2,..., 


> EX? < oX. 
1 1 


5. FURTHER MARTINGALE THEOREMS 

To go further and apply the basic convergence theorem, we need: 
Definition5.18. Let X,,X,,...bea process. It is said to be uniformly integrable 
if E |X,| < œ, all k, and 


lim lim IX,„| dP = 0. 


zto n JilXal> 2} 


Proposition 5.19. Let X,,Xq, . . .be uniformly integrable, then lim E |X,| < ©. 
If X,, —> X a.s, then E |X — X,| — 0, and EX, EX. 


Proof. First of all, for any x > 0, 


E|X,| < x + f |X,,| dP. 
{|Xn]>2} 
Hence, 


lim E |X,| < x + lim \X,,| dP. 
{]Xn]> a} 


But the last term must be finite for some value of x sufficiently large. For 
the next item, use Fatou’s lemma: 


fix dP = fiim x, dP < fim [ dP. 


Now, by Egoroff’s theorem (Halmos [64, p. 88]), for any « > 0, we can take 
A €F such that P(A) < « and V, = {X, — X| — 0 uniformly on 4°, 


fim Ev, = lim] 1X, — XI dP < Tim | (X,,| dP + f xı aP. 
A A A 
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But bounds can be gotten by writing 


(5.20) Í \X,| dP = IX„| dP + \X,,| dP 
A AN |Xnl So} AN(IXal> z} 
< xe + |X,,| dP. 
{[Xnl>2} 
Taking e | 0, we get 
lim EV, < lim |X,| dP +] |X| dP, 
n J{|Xnl>2} {IX > 2} 


since (5.20) holds for |X| as well as |X,,|. Now letting x Ì œ gives the result. 
For the last result, use |EX — EX,| < E|X — X,]. 


We use this concept of uniform integrability in the proof of: 


Theorem 5.21. Let Z,Y,,Y.,..., be random variables on (Q, F, P), such that 
E|Z| < œ. Then 


E(Z | Yis Ya... Y) ES E(Z | Yyy Yas- -). 
1 
(Here > indicates both a.s. and first mean convergence.) 


Proof. Let X, = E(Z|Yy,...,Y,). Then E|X,| < E |Z|. By 4.20(3) 


E(Xny | Yo... Yn) = Xn as. 
Since 
F(X, Cr | Xn) c F(Y, ...:}3 Ya), 


the sequence X, Xz, . . . is a MG. Since lim E |X,| < 00, 
X,—>X, and  £E|X| <limE|x,| < EIZ]. 
By convexity, |X,| < E(IZ||Yz,..., Yn). Hence 


f „l dP fe A(z IYy,...5Y,) dP =| [Z| dP 
ieee {Xal ><} 


< f |Z] dP. 
{SUP |[Xpl> 2} 


Now, 
U = sup |X,| 


is a.s. finite; hence as x f œ, the sets {U > x} converge down to a set of 


probability zero. Thus the X, sequence is uniformly integrable and X, 4, x. 
Let A E€ F(Y,,..., Yy); then 


lim, | x,aP = Í X dP. 
A A 
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[ xap = | zar. 
A A 


This implies that X = Z a.s. 
Corollary 5.22. If F(Z) = F(Y,, Ya, . . .), and E |Z| < œ, then 


But for n > N, 


E(Z|Y,,...,Y,)—>Z. 
In particular, if A € F(Y., ...), then 
P(A | Vig nus. YD ES yalo). 
Proof. Immediate. 

There is a converse of sorts, most useful, to 5.22. 
Theorem 5.23. Let X,, Xa, ... be a SMG (MG), and uniformly integrable. 
Then X,, fens X, and 

X, < E(X | X,..., Xn) 
(with equality holding if X,, Xa, . . . is MG). 
Proof. By definition, for every m > n, 
E(X,, | Xa... +> Xn) > X, a.s. 
(=) 
Thus, for every set A € F(X., . . . , Xn), and m > n, 


hse sfs dP. 


But by (5.19) lim £ |X,] < œ, hence X, —> X, implying X,, —> X, so 


fx dP < Xa, =>X, < E(X | Xp... , Xp) as. 
(=) 4 (=) 


Therefore, every uniformly integrable MG sequence is a sequence of con- 
ditional expectations and every uniformly integrable SMG sequence is bounded 
above by a MG sequence. 

In 5.21 we added conditions, that is, we took the conditional expectation 
of Z relative to increasing sequence o-fields F(Y,,..., Y,). We can also go 
the other way. 

Theorem 5.24. Let Z, Y,,... be random variables, E |Z| < œ, % the tail 
o-field on the Y,, Ya, . . . process. Then 


E(Z Yn Yuin +) —> EZ | 8). 
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Proof. A bit of a sly dodge is used here. Define X_, = E(Z | Yn Yasui - +s 
n = 1,2,... Note that 
E(X_, | Yati .) = X_n-w 

so Lemma 5.17 is applicable to the sequence X_jy, X_ayg4i1,.--, X—ı. For any 
interval [a, b], if By, is the number of times that this sequence crosses from 
below a to above b, then 
E(X_,—a)* 

b-a `’ 
By the same argument as in the main convergence theorem, X_,, either must 
converge to a random variable X a.s. or |X_,,| — œ with positive probability. 
But E|X_,| < E|Z|, so that Fatou lemma again gives X_, > X a.s., and 
E|X| < E|Z|. Just as in the proof of 5.21, 


Í X_,| dP <Í |Z| dP <| IZ] dP, 
{[X—nl> 2} {|X nl] >2} {SUP |X_n{> z} 


n21 


Em < 


so the X_,, sequence is uniformly integrable; hence E(Z | Nis Vga) 2X 
Since X is an a.s. limit of random variables measurable with respect to 
F(Y,,..-), then X is measurable with respect to F(Y,,...) for every n. 
Hence X is a random variable on (Q, 3). Let A € J; then AE F(Y,,...), 
and 


[Xap = tim | E(Z | Yoo.) dP =] Z dP, 
A n JA A 


which proves that X = E(Z | 3) a.s. 


Problems 


4. Show that E|X,| < œ, E|X| < œ, E|X, — XI > 0 = XX, Xg,...is 
a uniformly integrable sequence. Get the same conclusion if X, IX, 
and E |X,| > E |X|. 

5. Apply Theorem 5.21 to the analytic model for coin-tossing and the coin- 
tossing variables defined in Chapter 1, Section 5. Take f(x) any Borel 
measurable function on [0, 1) such that f | f(x)| dx < œ. Leth, Iz... , Iy 


be the intervals 
| 1 E 2 
0,— E es Ce) 
acer, 


and define the step function f,(x) by 


Ax) = aa | fO)dy, xel, k=1,...,N. 
Well J 7 


Then prove that 
A => SO. 
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6. Given a process X,, X,,..., abbreviate F, = F(X,, Xyy,..-). Use 
5.24 to show that the tail o-field 4 on the process has the zero-one property 
(C e€ J > P(C) = 0 or 1) iff for every A € F(X), 
lim sup |P(A A B) — P(A)P(B)| = 0. 
n BeFn 


7. Use Problem 15, Chapter 4, to prove the strong law of large numbers 
(3.30) from the martingale results. 


6. STOPPING TIMES 


Definition 5.25. Given a process X,, Xa, . . . , an extended stopping time is an 
extended integer-valued random variable n*(w) > 1 such that {n* = j} € 
F(X, ..., Xj). The process X, Xz, . . . derived under stopping is defined by 


X X,, if n <n*(w), 
X, n > n*(w). 


If we define variables m„(w) as min (n, n* (w)), then obviously the m,(w) 
are optional sampling variables, and the X, defined in 5.25 above are given 
by Xa = Xm, Hence stopping is a special case of optional sampling. 
Furthermore, 


Proposition 5.26. Let X,, X,... be a SMG(MG), and X,, Xz, . . . be derived 
under stopping from X,, Xa, ... Then X,, Xa, ... is aSMG(MG). 


Proof. All that is necessary is to show that conditions (a) and (b) of the 
optional sampling Theorem 5.10 are met. Now 


{n>n*(w 


EX =| IX, dP + IXe] dP 
{ns n*} } 


n—-1 n 
SEKI+S | 1dr < SED. 
g=1 J{n =)} j=1 


Noting that m,(@) = min (n, n*(w)) < n, we know that {m, > N} = Ø, 
for N > n, and 
Í |Xyl dP = 0. 
{mr > N} 


Not only does stopping appear as a transformation worthy of study, 
but it is.also a useful tool in proving some strong theorems. For any set 
B € &,, and a process X,, X,,..., define an extended stopping time by 

first n such that X, € B, 
n*(œw) = 


œ if no such n exists. 


Then X,, Xə, . . . is called the process stopped on B. 
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Proposition 5.27. Let Xi, Xa... be a SMG, B the set [a, ©], a> 0. If 
Efsup (Xp+ı — X] < 0, then for X, the process stopped on B, 
i lim £ |X,„| < œ. 
Proof. For any n, Xt < a + U, where 
U= sup (Xati — Xp). 


But X, = X,. By the fact that X, is a SMG, EX, > EX,, so EX; < 
EX} — EX,. Thus 
E|X,| < 2EXt — EX, < 2a + 2EU — EX.. 


Theorem 5.28. Let X,, Xz, . .. be a MG such that E(sup |X,,41 — X,|) < ©. 
If the sets A,, Az are defined by 2 


A= fo; lim X,(%) exists}, 


A: = {w; lim X,(@) = +00, lim X,(@) = —oo}, 
then Ay U A, = Q a.s. 


Proof. Consider the process X,, Xs, ... stopped on [K, œ]. By 5.27 and 


the basic convergence theorem X, => X. On the set Fg = {sup, X, < K}, 
X, =X,, alln. Hence on Fg, lim, X, exists and is finite a.s. Thus this limit 
exists and is finite a.s. on the set Ux_, Fx, but this set is exactly the set 
{lim X,, < 00}. By using now the MG sequence —X,, — X», . . . , conclude 
that lim,, X, exists and is finite a.s. on the set {lim X, > — œ}. Hence lim X, 


exists and is finite for almost all w in the set {lim X, < œ} U {lim X, > —oo}, 
and the theorem is proved. a 


This theorem is something like a zero-one law. Forgetting about a set 
of probability zero, according to this theorem we find that for every w, 
either lim X,,(@) exists finite, or the sequence X,,(w) behaves badly in the sense 


that lim X,(w) = —00, lim X,(w@) = +00. There are some interesting and 


useful applications. An elegant extension of the Borel-Cantelli lemma valid 
for arbitrary processes comes first. 


Corollary 5.29 (extended Borel-Cantelli lemma). Let Yı, Y:,... be any 
process, and A, © F(Y,,..., Y,). Then almost surely 


{w; w E An i.0.} = fo; S Plil Vagos Y= o). 
1 
Proof. Let 


Xnti = È las — P(Agsi | Ves +s Yad) n> tl. 
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Note that 
E(XnalYw---> Y) = E(Ixa, ~ P(Ania| Yme YDI Yn + +s Ya) + Xn 
= X, 
Obviously, also |X,4;| < 7, so {X,} is a MG sequence. To boot, 
Xa — Xal <1, 
so 5.28 is applicable. Now 


{w; w EA, i.0.} = fo; È zalo) = o). 


Let D, = {lim X, exists finite}. Then on D,, 


2 ina 00 <> D PAn l Yur s+ + Ya) = ©. 


Let D, = {lim X, = —o, lim X, = +}, then for all w € Dy, 


2 tay = oe) and È PAnial Ye +> Ya) = oon 


Since P(D, U D,) = 1, the corollary follows. 


Some other applications of 5.28 are in the following problems. 


Problems 


8. A loose end, which is left over from the random signs problem, is that if 
Yi» Yo,... are independent random variables, EY, = 0, by the zero-one law 
either X, = J? Y,, converges to a finite limit a.s. or diverges a.s. The nature of 
the divergence can be gotten from 5.28 in an important special case. Show 
that if Y,,... are independent, |Y,| < «x < œ, all k, EY, = 0, S, = 
Yi +° + Y, then either 


a) P(lim Sp exists) = l,or 


b) Pim S,, = 00, lim S, = — œ) = 1. 


9. For any process X,, Xz, ... and sets A, B € B,, suppose that P(X,, € B 
for at least one m > n|X,,..., X1) > 6 > 0, on {X, E A}. Then prove 
that 

{X, E Ai.o.} € {K, E Bi.o.} as. 


[Let Fy = U {X,, E€ B}, and use P(Fy | X,°+* X1) > Ary] 
N 
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10. Consider a process X,, X»,... taking values in [0, 00). Consider {0} 
an absorbing state in the sense that X, = 0 > X,,,,=0,m> 1. Let D 
be the event that the process is eventually absorbed at zero, that is, 


D = {Jn such that X, = 0}. 
If, for every x there exists a ô > 0 such that 
P(D|X,,...,X%) 26, if X, <x, n =1,2,..., 


prove that for almost every sample sequence, either X,, X,. . . is eventually 
absorbed, or X, — œ. 


7. STOPPING RULES 


Definition 5.30. If an extended stopping time n* is a.s. finite, call it a stopping 
time or a stopping rule. The stopped variable is X». 


It is clear that if we define m,(w) = 1, m (œ) = n*(w), nm > 2, then the 
m,, are optional sampling variables. From the optional sampling theorem we 
get the interesting 


Corollary 5.31. Let n* be a stopping rule. If X,, Xa, ... is a SMG(MG) 
and if 


a) E |Xpe| < ©, 
b) imf, Xxl dP = 0, 
N vin*>N} 
then 
(5.32) EX,* > EX.. 
(=) 


Proof. Obvious. 


The interest of 5.31 in gambling is as follows: If a sequence of games is 
unfavorable under a given gambling system, then the variables —S,, form 
a SMG if E |S,| < œ. Suppose we use some stopping rule n* which based on 
the outcome of the first j games tells us whether to quit or not after the jth 
game. Then our terminal fortune is §$,.. But if (a) and (b) are in force, then 


ES,+ < ES, 


with equality if the game is fair. Thus we cannot increase our expected 
fortune by using a stopping rule. Also, in the context of gambling, some 
illumination can be shed on the conditions (a) and (b) of the optional 
sampling theorem. Condition (b), which is pretty much the stickler, says 
roughly that the variables m, cannot sample too far out in the sequence too 
fast. For example, if there are constants «, such that m, < «,, all n (even 
if «, —> 00), then (b) is automatically satisfied. A counterexample where (b) 
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is violated is in the honored rule “stop when you are ahead.” Let Y}, Yo, ... 
be independent random variables, Y; = +1 with probability 4 each. Then 
X, = Yı +*+: + Y,, isa MG sequence and represents the winnings after n 
plays in a coin-tossing game. From Problem 8, lim X, = +œ a.s., hence 
we can define a stopping rule by 


n* = {firstm such that X, = 1}, 


that is, stop as soon as we win one dollar. If (5.32) were in force, then 
EX, = EX, = 0, but X,. = 1. Now (a) is satisfied because E |X,.| = 1, 
hence we must conclude that (b) is violated. This we can show directly. 
Note that |X,| is a SMG sequence, and 


{X = —1, X; # 0,..., Xy ¥ 0} c {n* > N} 


Therefore 
Í |Xy] dP >| [Xl dP 
{n* > N} {Xy=—1,X2#0,..., Xy_1 #0} 
>| [Xy_al dP 
(Xj=—-1, . . . XN -1 #0} 
= IXy_i] dP. 
{Xı=—1, . . . ,Xy-2+0} 


Going down the ladder we find that 


Í Kyl dP > | Xıl dP = 4}. 
{n*> N} {Xı1=—1} 


Here, as a matter of fact, n* can be quite large. For example, note that 
En* = œ, because if Y, = —1, we have to wait for an equalization, in 
other words, wait for the first time that Y, + --- + Y, = 0 before we can 
possibly get a dollar ahead. Actually, on the converse side, it is not difficult 
to prove. 


Proposition 5.33. Let X,, X, ... be a SMG(MG) and n* a stopping rule. If 
En* < œ and E(|Xai1 — Xal | Xn... X) Sa < ©, n L n*, then 


EX, > EX.. 
(=) 
Proof. All we need to do is verify the conditions of 5.31. Denote Z, = 
IX, — Xl,“ > 1, Z = |X Y = Z, +--+ + Za. Hence |X,.| < Y. 


œ o k 
E¥=>[  yYap=3> >f Z, dP. 
k=1 j=1 J{n"=k} 


k=1 J{n*=k} 
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Interchange the order of summation so that 


EY => Í Z, dP. 
j=1 J{n*2 3} 


The set {n* > j} = {n* < j¥ is in ¥(X,..., X;_). Therefore 


Í Z, dP = E(Z;| Xis- - -> X1) dP < aP(n* > j), 
{n*> 3} {n*> 3} 
and we get ss 
EY < aJ P(n* > j) = aEn*. 
1 
For 5.31(b), since Z, + --- + Zy < Y on the set {n* > N}, 


Í Xy] dP <| Y dP. 
{n*>N} {n*>N} 


As N — œ, {n* > N} | Ø a.s. Apply the bounded convergence theorem to 
Xint>Nny¥ to get the result. 


Proposition 5.33 has interesting applications to sums of independent ran- 
dom variables. Let Y,, Y,,...be independent and identically distributed, 
S, = Y, +++: + Y,, and assume that for some real A # 0, g(4) = Ee 
exists, and that y(4) > 1. Then 


Proposition 5.34 (Wald’s identity). If n* is a stopping time for the sums 
Si, Sg, - . . such that |S,| < y, n <n*, and En* < œ, then 


E 


Proof. The random variables 


ASa 
X 
"Ay? 
form a MG, since 
ef Snt1 e*n en 
(m Sarai ) = — Ele” — as 
g(a 1 g(A)r*t ( ) (A) n 


Obviously, EX, = 1, so if the second condition of 5.33 holds, then Wald’s 
identity follows. The condition is 


efSntt 


E( — e Sw + -> Si) < ap, n< n*, 
(A) , 
or 
ing | ES 1|< (a) <n* 
e™rE | — — 1| < ag(A)”, n<n 
¢(A) 


which is clearly satisfied under 5.34. 
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Problems 


11. For n* a stopping time for sums S,, S,,... of independent, identically 
distributed random variables Y}, Y;,,..., £|Y;| < œ, prove that En* < œ 
implies that 

ES,+ = EY,° En*. 


Use this to give another derivation of Blackwell’s equation (Problem 21, 
Chapter 3). 


12. Let Yı, Yo,... be independent and equal +1 with probability 4. Let 
r, s be positive integers. Define 


n* = {first n such that S, = r or ~s}. 


Show 
a) En* < 0, 
b) PiS =r) = ——. 


r+s 
For r = s, evaluate Ee~*"", A > 0. 


13. (See Doob, [39, p. 308].) For sums of independent, identically dis- 
tributed random variables Y}, Y:,..., define n* as the time until the first 
positive sum, that is, 

n* = min {n; S, > 0}. 


Prove that if EY, = 0, then En* = œ. 


8. BACK TO GAMBLING 


The reason that the strategy “quit when you are ahead” works in the 
fair coin-tossing game is that an infinite initial fortune is assumed. That is, 
there is no lower limit, say — M, such that if S, becomes less than —M play 
is ended. 

A more realistic model of gambling would consider the sequence of 
fortunes So, S,, S2, . . . (that is, money in hand) as being nonnegative and 
finite. We now turn to an analysis of a sequence of fortunes under gambling 
satisfying 


(5.35) i)S,>0, n=0,1,2,..., S) constant, 
ii) ES, < œ and E(Spy l Sns- - -s So) <S, as, n>O. 


In addition, in any reasonable gambling house, and by the structure of our 
monetary system, if we bet on the nth trial, there must be a lower bound to 
the amount we can win or lose. We formalize this by 


Assumption 5.36. There exists ad > 0 such that either 
Sarm Sa or Snp — Spl > ô. 
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Definition 5.37. We will say that we bet on the nth game if |S, — S,_,| > 4. 
Let n* be the (possibly extended) time of the last bet, that is, 
largest n such that |S, — S,_,| > ô, 


n* = 
+œ ifno such n exists. 


where S, is the starting fortune. 


Under (5.35), (i) and (ii), and 5.36, the martingale convergence theorem 
yields strong results. You can’t win! 


Theorem 5.38. 
P(n* < œ) = 1, ES, < So. 


Remark. The interesting thing about P(n* < œ) = 1 is the implication that 
in an unfavorable (or fair) sequence of games, one cannot keep betting 
indefinitely. There must be a last bet. Furthermore, £S,. < Sọ implies that 
the expected fortune after the last bet is smaller than the initial fortune. 


Proof. Let X, = —S,; then the Xo, X,,... sequence is a SMG. Further- 
more, E |X,| = E\|S,| = ES. Thus E|X,| < So, all n. Hence there exists 
a random variable X such that X, —*, X. Thus 
P (Xr+ = Xl 2 ò i.o.) = 0, 
or 
P(Srvi — S,| > 61.0.) = 0 or P(n* < wo) =1. 


To prove the second part use the monotone convergence theorem 


[Se dP = tim Í S,» dP. 
n {n*<n} 


But on {n* < n}, Sae = S,,; hence 


ES,+ = lim S„dP < lim ES, < So. 
n J{n*Sn} 

Note that the theorem is actually a simple corollary of the martingale 
convergence theorem. Now suppose that the gambling house has a minimum 
bet of « dollars and we insist on betting as long as S, > «; then n* becomes 
the time “‘of going broke,” that is, n* = {first n such that S, < a}, and the 
obvious corollary of 5.38 is that the persistent gambler goes broke with 
probability one. 


NOTES 


Martingales were first fully explored by Doob, in 1940 [32] and systematically 
developed in his book [39] of 1953. Their widespread use in probability 
theory has mostly occurred since that time. However, many of the results 
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had been scattered around for some time. In particular, some of them are due 
to Lévy, appearing in his 1937 book [103], and some to Ville [137, 1939]. 
Some of the convergence theorems in a measure-theoretic framework are due 
to Andersen and Jessen. See the Appendix to Doob’s book [39] for a dis- 
cussion of the connection with the Andersen-Jessen approach, and complete 
references. The important concepts of optional sampling, optional stopping, 
and the key lemma 5.17 are due to Doob. 

David Freedman has pointed out to me that many of the convergence 
results can be gotten from the inequality 5.13. For example, here is a more 
elementary and illuminating proof of 5.21 for Z an ¥(X) measurable random 
variable. For any e > 0, take k, Z, measurable ¥(X,,..., X,) such that 


E|Z-Z,| S €. 
Now, 
E(Z,|X,...,%,) = Zp n>k. 
Let 
Y, = E(Z — Z; | Xp... Xa) 


then the {Y,,} is a MG, {|Y,|} is a SMG, and by 5.13, 


P( sup [Yl > x) 2 Hite 2 *Hiz =i. 
15nSN x x 


Thus, 
P (Tm IEZI Ku pee ZS x) aje 
n 


Take e | 0 fast enough so that Z, 225 Z to get the result 


E(Z | Xn... X) 2 Z. 


For a fascinating modern approach to gambling strategies, see the 
book by Dubins and Savage [40]. 


CHAPTER 6 


STATIONARY PROCESSES 
AND THE ERGODIC THEOREM 


1, INTRODUCTION AND DEFINITIONS 


The question here is: Given a process X,, Xs,..., find conditions for the 
almost sure convergence of (X, +--+: X,)/m. Certainly, if the {X,} are 
independent identically distributed random variables and E |X,| < œ, then, 


Xi +: +X, as, 
n 


EX,. 


A remarkable weakening of this result was proved by Birkhoff in 1931 [4]. 
Instead of having independent identically distributed random variables, 
think of requiring that the distribution of the process not depend on the 
placement of the time origin. In other words, assume that no matter when 
you start observing the sequence of random variables the resulting observa- 
tions will have the same probabilistic structure. 


Definition 6.1. A process X,, X,,... is called stationary if for every k, the 
process X15, Xap» - - - has the same distribution as X,, X,,..., that is, for 
every BE By. 

(6.2) P((X,, Xo, ...) E€ B) = P(X Xis - - .) € B). 

Since the distribution is determined by the distribution functions, (6.2) is 
equivalent to: For every x,,...,X,, and integer k > 0, 

(6.3) P(X, < Xis E a Xn < Xn) = P(X p41 < Xis a i Kein < Xn) 


In particular, if a process is stationary, then all the one-dimensional dis- 
tribution functions are the same, that is, 


P(X, < x) = P(X, < x), k = 1,2,... 
We can reduce (6.2) and (6.3) by noting 


Proposition 6.4. A process X,, X,,... is stationary if the process X,, Xs, . . 
has the same distribution as X,, Xa, ... 


Proof. Let X, = X1 k = 1,2,... Then Xj, X;,... has the same distribution 
as Xi, Xa... Hence X,, Xj,... has the same distribution as Xi, X;,..., 
and so forth. 
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Sometimes it is more convenient to look at stationary processes that 
consist of a double-ended sequence of random variables... , X, Xo X1,..- 
In this context, what we have is an infinite sequence of readings, beginning 
in the infinitely remote past and continuing into the infinite future. Define 
such a process to be stationary if its distribution does not depend on choice 
of an origin, i.e., in terms of finite dimensional distributions: 


P(X < Xp- Xn < Xn) = Perr < Xe s Keen < Xn) 
for all x,,..., x, and all k, both positive and negative. 


The interesting point here is 


Proposition 6.5. Given any single-ended stationary process X,, Xa, . . . , there 
is a double-ended stationary process... , X_y, Xo, Xi, .. . such that Xy, Xz, . 
and X,, X, . . . have the same distribution. 


Proof. From the Extension Theorem 2.26, all we need to define the X, 
process is a set of consistent distribution functions; i.e., we need to define 


P(X m < Xm., Xo < Xo.. Xn < Xn) 


such that if either x_,, or x, T œ, then we drop down to the next highest 
distribution function. We do this by defining 


P(X [L Xm: X, < Xn) = P(X, < Xm.. Xntm+1 < Xan) 


that is, we slide the distribution functions of the X,, . . . process to the left. 
Now Xj, X,,... can be looked at as the continuation of a process that has 
already been going on an infinite length of time. 

Starting with any stationary process, an infinity of stationary processes 
can be produced. 


Proposition 6.6. Let X,, X2,... be stationary, p(x) measurable Ba, then the 
process Yı, Ya, . . . defined by 


Yu = P(X Xen- -) 
is stationary. 


Proof. On R'™? define p,(x) as p(x, Xz11,-..). The set 
A = {x; (p(X), p), . . -.) € B}, 


Be So, is in Bo, because each ¢,(x) is a random variable on (R'”), Bo). 
Note 

{w; (Yu, Ya...) € B} = {a3 (Xp Xa...) € 4} 
and 

lo; (Ya Ya...) € B} = {w; (Xa Xp...) € A}, 


which implies the stationarity of the Y, sequence. 
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Corollary 6.7. Let X,, Xs,... be independent and identically distributed 
random variables, p(x) measurable Ba; then 


Yu = P(X Xir + +) 
is stationary. 


Proof. The X,, Xz, . . . sequence is stationary. 
Problem 1. Look at the unit circle and define 


(x) 1 O<x<z, 
x} = 
f 0, 7< x < 27. 


Here Q is the unit circle, F the Borel o-field. Take P to be Lebesgue measure 
divided by 27. Take 0 to be an irrational angle. Define x, = x, x44) = 
(x, + Dir], and X,(x) = f(x,). Hence the process is a sequence of zeros 
and ones, depending on whether x, is in the last two quadrants or first two 
when x, is picked at random on the circumference. Prove that X,, X2,. . . is 
stationary, ({a] denotes modulo «). 


2. MEASURE-PRESERVING TRANSFORMATIONS 


Consider a probability space (Q, F, P) and a transformation T of Q into 
itself. As usual, we will call T measurable if the inverse images under T of 
sets in F are again in F; that is, if 714A = {w; Tw e A}eF, all Ac F. 


Definition 6.8. A measurable transformation T on QQ will be called 
measure-preserving if P(T-1A) = P(A), all AEF. 


To check whether a given transformation T is measurable, we can easily 
generalize 2.28 and conclude that if TC e€ F, for Ce C, F(C) = F, then 
T is measurable. Again, to check measure-preserving, both P(T—14) and 
P(A) are o-additive probabilities, so we need check only their agreement on 
a class C, closed under intersections, such that F(C) = F. 

Starting from measure-preserving transformations (henceforth assumed 
measurable) a large number of stationary processes can be generated. Let X (w) 
be any random variable on (Q, F, P). Let T be measure-preserving and define 
a process X,, Xa, . . . by X,(w) = X(w), X,(w) = X(Tw), Xw) = X(T*w), ... 
Another way of looking at this is: If X,(w) is some measurement on the 
system at time one, then X,(w) is the same measurement after the system 
has evolved n — 1 steps so that œ — T”~'w. It should be intuitively clear 
that the distribution of the X,, X, . . . sequence does not depend on origin, 
since starting from any X,(w) we get X,,,,(w) as X,(Tw), and so forth. To 
make this firm, denote by T° the identity operator, and we prove 
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Proposition 6.9. Let T be measure preserving on (Q,¥,P), X a random 
variable on (Q, F); then the sequence X,(w) = X(T" œ), n = 1,2,...isa 
stationary sequence of random variables. 


Proof. First of all, X,(w) is a random variable, because {X,(w) € B} = 
{X(T w) € B}. Let A = {X € B}. Then 

{X(T tw) € B} = {w; Tw € A}. 
Evidently, however, T measurable implies 7T”? measurable, or T-*+14 € F. 
Now let A = {w; (X, Xa, . . .) € B}, Be Ba, thus 

A = {w; (X(a), X(Tæ), . . .) € B}. 


Look at A, = {w; (Xa X ...)€ B}. This similarly is the set 
{w; (X(Tw),X(T?w), ...) € B}. Hence w € A <> To € A or A, = TA. 
But, by hypothesis, P(7-1A) = P(A). 

Can every stationary process be generated by a measure-preserving 
transformation? Almost! In terms of distribution, the answer is Yes. 
Starting from any stationary process X,(w), X,(w), . . . go to the coordinate 
representation process X,, Xz, . .. on (R'™), B,, P). By definition, 


K(X) = Xp. 


Definition 6.10. On(R', B) define the shift transformation S: R®™ —> R®? 
by Sm, Xa, > .) zz (x2, Xg-- ). 


So, for example, S(3, 2, 7, 1,...) = (2,7, 1,...). 

The point is that from the definitions, X,(x) = X,(S""!x). We prove 
below that S is measurable and measure-preserving, thus justifying the 
answer of “Almost” above. 


Proposition 6.11. The transformation S defined above is measurable, and if 
X,, Xa, ... is stationary, then S preserves P measure. 


Proof. To show S measurable, consider 
B= Sx; x, Els... Xn E Ip} = SC. 
By definition, letting (Sx), be the kth coordinate of Sx, we find that 
B= {x; (Sx), €%,..., (Sx), E In} = (x; Xa € hp.. Xn E Inh 


and that B is therefore obviously in B.. Furthermore, by the stationarity of 
Xo Xz- 


A 


P(B) = P(S7C) = P(X, € L,- <., Xaa € Ip) 
= P(X, €h,...,X%,€1,) = Ê(C), 


So S is also measure-preserving. 


108 STATIONARY PROCESSES AND THE ERGODIC THEOREM 6.3 


Problems 


2. Show that the following transformations are measurable and measure- 
preserving. 


1) Q = [0, 1), F = B[0, 1), P = dx. Let A be any number in [0, 1) and 
define Tx = (x + A)[I). 

2) Q = [0, 1), F = BO, 1), P = dx, Tx = (2x)[1]. 

3) Same as (2) above, but Tx = (kx)[1], where k > 2 is integral. 


3. Show that for the following transformations on [0, 1), B[0, 1), there is 
no P such that P(single point) = 0, and the transformation preserves P. 


1) Tx = âx, O<A <1. 
2) Tx = x’. 


4. On Q = (0, 1), define T:Q + Q by Tx = (2x)[1]. Use F = B([0, 1)), 
P = dx. Define 


0, O< x <h, 
X(x) = 
l, ¢<gx<1. 


Show that the sequence X,(x) = X(T”-1x) consists of independent zeros 
and ones with probability 4 each. 

Show that corresponding to every stationary sequence X,(w), X,(w),... 
such that X,(w) € {0, 1}, there is a probability Q(dx) on B[0, 1) such that 
Tx = (2x)[1] preserves Q-measure, and such that the X,,(x) sequence defined 
above has the same distribution with respect to B[0, 1), Q(dx) as X,,X. ,. 


3. INVARIANT SETS AND ERGODICITY 
Let T be a measure-preserving transformation on (Q, F, P). 
Definition 6.12. A set AEF is invariant if T'A = A. 


If A is an invariant set, then the motion T of Q — Q carries A into A; that 
is, if œ € A, then Tw € A (because T-1A4° = A’). A° is also invariant, and 
for all n, T” carries points of A into A and points of A’ into A*. Because of the 
properties of inverse mappings we have 


Proposition 6.13. The class of invariant sets is a o-field 3. 
Proof. Just write down definitions. 


In the study of dynamical systems, Q is the phase space of the system, 
and if w is the state of the system at t = 0, then its state at time 7 is given by 
T,@, where T;,:Q — Q is the motion of the phase space into itself induced by 
the equations of motion. For a conservative system T (T œ) = T,,,@. 
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We discretize time and take T = T,, so the state of the system at time n 
is given by Tw. Suppose that X(w) is some observable function of the state 
w. Physically, in taking measurements, the observation time is quite long 
compared to some natural time scale of molecular interactions. So we 
measure, not X(w), but the average of X(w) over the different states into which 
w passes with the evolution of the system. That is, we measure 
1 T 
= Í X(T;œ) dt, 
0 


T 


for 7 large, or in discrete time 


= ole 


$ X(T w), 


for n large. The brilliant insight of Gibbs was the following argument: 
that in time, the point w, = T,w wandered all over the phase space and 
that the density of the points w, in any neighborhood tended toward a 
limiting distribution. Intuitively, this limiting distribution of points had to be 
invariant under 7. If there is such a limiting distribution, say a measure P, 
then we should be able to replace the limiting time average 


L > X(T") 
ni 

by the phase average 
f X(w)P(do). 


Birkhoff’s result was that this argument, properly formulated, was true! 
To put Gibb’s conjecture in a natural setting, take Q to be all points on a 
surface of constant energy. (This will be a subset of R®® where n is the 
number of particles.) Take ¥ to be the intersection of B,, with Q and P the 
normalized surface area on Q. By Liouville’s theorem, T:Q — Q preserves 
P-measure. The point is now that T”w will never become distributed over Q 
in accordance with P if there are invariant subsets A of Q such that P(A) > 0, 
P(A*) > 0, because in this case the points of A will remain in A; similarly, 
for 4°. Thus, the only hope for Gibb’s conjecture is that every invariant set 
A has probability zero or one. 
To properly formulate it, begin with 


Definition 6.14. Let T be measure-preserving on (Q, F, P). T is called ergodic 
if for every A € 3, P(A) = Q or 1. 


One question that is relevant here is: Suppose one defined events A to 
be a.s. invariant if P(A A TA) = 0. Is the class of a.s. invariant events 
considerably different from the class of invariant events? Not so! It is 
exactly the completion of J with respect to P and F. 
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B’ 


Figure 6.1 


Proposition 6.15. Let A € F be a.s. invariant, then there is a set A’ € F which 
is invariant such that P(A A A’) = 0. 


Proof. Let 


A" = U T™A, T'A = A. 


0 


Then A” = A a.s., and T'A” < A”. Let 
@o 
A’ = NTA", 
0 


noting that A” is a.s. invariant gives A’ = A a.s., and T-'A”C A” implies 
that A’ is invariant. 

The concept of ergodicity, as pointed out above, is a guarantee that the 
phase space does not split into parts of positive probability which are 
inaccessible to one another. 


Example. Let Q be the unit square, that is, Q = {(x, y); O< x <1, 
0 <y < 1}, F the Borel field. Let a be any positive number and define 
T(x, y) = (x + a)[l], (y + d[1]). Use P = 
dx dy; then it is easy to check that T is measure 
preserving. What we have done is to sew edges 
of Q together (see Fig. 6.1) so that « and «’ are 
together, 8 and £’. T moves points at a 45° angle 
along the sewn-together square. Just by looking 
at Fig. 6.1 you can see that T does not move 
around the points of Q very much, and it is 
easy to construct invariant sets of any probabil- 
ity, for example, the shaded set as shown in 
Fig. 6.2. 
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Problems 


5. In Problem 2(1), show that if A is rational, T is not ergodic. 


6. We use this problem to illustrate more fully the dynamical aspect. Take 
(Q, F) and let T:Q — Q be measurable. Start with any point w € Q, then 
w has the motion w > Tw > T?w —--- Let N,(A, œw) be the number of 
times that the moving point 7*w enters the set A € F during the first n 
motions; that is, N,(A,@) is the number of times that T*w € A, k = 
0,1,...,2—1. Keeping œ fixed, define probabilities P®() on F by 
P@(-) = N,(-, œ)/n. That is, P(A) is the proportion of times in the first 
n moves that the point is in the set A. Let X be any random variable on 
(Q, F). 

a) Show that 


L'S X(T*o) = [xoarvo. 
N k=0 


Assume that there is a probability P,(-) on F such that for every A € F, 
lim, P® = P (A). 
b) Show that T is P,,(-)-preserving, that is, 

P,(T1A) = P(A), all AEF. 


c) Show that if X(w) > 0 is bounded, then 
n—1 
© 2, X(T) > [XO dP 
N k=0 


What is essential here is that the limit f X dP „ not depend on where the system 
started at time zero. Otherwise, to determine the limiting time averages of 
(c) for the system, a detailed knowledge of the position w in phase-space at 
t = 0 would be necessary. Hence, what we really need is the additional 
assumption that P_,(-) be the same for all w, in other words, that the limiting 
proportion of time that is spent in the set A not depend on the starting posi- 
tion w. Now substitute this stronger assumption, that is: There is a proba- 
bility P(-) on F such that for every A € F and w E€ Q, 


lim P(A) = P(A). 
d) Show that under the above assumption, A invariant > A = Q or Ø. 


This result shows not only that T is ergodic on (Q, F, P), but ergodic 
in a much stronger sense than that of Definition 6.14. The stronger assumption 
above is much too restrictive in the sense that most dynamical systems do 
not satisfy it. There are usually some starting states w which are exceptional 
in that the motion under T does not mix them up very well. Take, for example, 
elastic two-dimensional molecules in a rectangular box. At t = 0 consider 
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the state w which is shown in Fig. 6.3, where all the molecules have the same 
x-coordinate and velocity. Obviously, there will be large chunks of phase- 
space that will never be entered if this is the starting state. What we want then, 
is some weaker version that says P‘")(-) — P(-) for most starting states w. 
With this weakening, the strong result of (d) above will no longer be true. 
We come back later to the appropriate weakening which, of course, will 
result in something like 6.14 instead of (d). 


Figure 6.3 


4. INVARIANT RANDOM VARIABLES 
Along with invariant sets go invariant random variables. 


Definition 6.16. Let X(w) be a random variable on (Q,F¥, P), T measure- 
preserving; then X(w) is called an invariant random variable if X(w) = X(Tw). 


Note 
Proposition 6.17. X is invariant iff X is measurable 3. 


Proof. If X is invariant, then for every x, {X < x} €J; hence X is measur- 
able J. Conversely, if X(w) = %4(%), A € 3, then 


X(Tw) = x4(Tw) = x7-14(@) = 74(@). 
Now consider the class £ of all random variables on (Q, J) which are invariant; 
clearly £ is closed under linear combinations, and X, € f, X,(w) fT X() 
implies 
X(Tw) = lim X,(Tw) = lim X,(w) = X(a). 


Hence by 2.38, £ contains all nonnegative random variables on (Q, 3). 
Thus clearly every random variable on (Q, J) is invariant. 

The condition for ergodicity can be put very nicely in terms of invariant 
random variables. 


Proposition 6.18. Let T be measure-preserving on (Q, F, P). T is ergodic iff 
every invariant random variable X(w) is a.s. equal to a constant. 


Proof. One way is immediate; that is, for any invariant set A, let X(w@) = 
xa(w). Then X(w) constant a.s. implies P(A) = 0,1. Conversely, suppose 
P(X < x) = 0,1 forall x. Since for x T? +0, P(X < x) > 1, P(X < x) = 1, 
for all x sufficiently large. Let x9 = inf {x; P(X < x) = 1}. Then for 
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every € >0, P(x —e€<X< x +e¢)=1, and taking ¢|0 yields 
P(X = x) = 1. 

Obviously we can weaken 6.18 to read 


Proposition 6.19. T is ergodic iff every bounded invariant random variable is 
a.s. constant, 


In general, it is usually difficult to show that a given transformation is 
ergodic. Various tricks are used: For example, we can apply 6.19 to Problem 
2(1). 


Example. We show that if å is irrational, then Tx = (x + A) [1] is ergodic. 
Let f(x) be any Borel-measurable function on [0, 1). Assume it is in L,(dx), 
that is, f f? dx < œ. Then we have 


+00 
f(x) =dc,e7"*,  a.s.(dx), 


where the sum exists as a limit in the second mean, and È |c,|*? < ©. 
Therefore 


+00 ; g 
f(Tx) = yee" P e?ne 


=% 


For f(x) to be invariant, c,(1 — e?"*"4) = 0. This implies either c, = 0 
or e27'"4 = 1. The latter can never be satisfied for nonzero n and irrational 
A. The conclusion is that f(x) = co a.s.; by 6.19, T is ergodic. 


Problems 


7. Use the method of the above example to show that the transformation of 
Problem 2(2) is ergodic. 


8. Using 2.38, show that if T is measure-preserving on (Q, F, P) and X(a) 
any random variable, that 


(6.20) E[X()] = E[X(To)]. 


5. THE ERGODIC THEOREM 


One of the most remarkable of the strong limit theorems is the result usually 
referred to as the ergodic theorem. 


Theorem 6.21. Let T be measure-preserving on (Q, F, P). Then for X any 
random variable such that E |X| < ©, 


n—1 
lim! 5 X(T) = E(X | J) a.s. 
n k=0 
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To prove this result, we prove first an odd integration inequality: 


Theorem 6.22 (Maximal ergodic theorem). Let T be measure-preserving on 
(Q, F, P), and X a random variable such that E |X| < œ. Define 


S,(w) = X(w) + +++ + X(T#1@), and M,(w) = max (0, Si, Sas.. , Sp). 
Then 
Í X dP > 0. 
(Mp >0} 


Proof. We give a very simple recent proof of this due to Adriano Garsia 
[61]. For any k < n, M,(Tw) > S,(Tw). Hence 


X(w) + M (Tw) > X(w) + $,(Tw) = Sp (0). 
Write this as 


X(w) 2 Sx41(@) es M,(To), k= L,...,%. 
But trivially, 
since S,(w) = X(w) and M,(w) > 0. These two inequalities together give 
X(w) > max (S,(),..., 5,(@)) — M,(Tw). Thus 


Í XdP > Í [max (S,(w), ... » $,()) — M„(Te)] dP. 
{Mp > 0} {M,,>0} 
On the set {M, > 0}, max (S,,...,5,) = M,. Hence 


f xara J, Mo) — MCT dP, 
{Mn > 0} {Mn > 0} 
but 
| M,(w) dP — Í M,(Tw) dP > f M,(w) dP — Í M,(Tw) dP = 0. 

{Ma > 0} {Mn >0} 
This last is by (6.20). 
Completion of proof of 6.21. Assuming that E(X | 3) = 0, prove that the 
averages converge to zero a.s. Then apply this result to the random variable 
X(w) — E(X | J) to get the general case. Let X = lim S,/n, and for any 
e > 0, denote D = {X > e}. Note that X(Tw) = X(w), so X and therefore 
D are invariant. Define the random variable 

X*(w) = (X(a) = «)xp(), 


and using X*, define SF, MŽ as above. 
The maximal ergodic theorem gives 


f X*dP >Q. 
(M3 > 0} 
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The rest of the proof is easy sailing. The sets 


F, = {M} > 0} = { max sf > 0) 
ISkSn 
converge upward to the set 


* 
F= {sup > o) = fup > o) 


k21 k21 


= (sup > d N D. 


k21 


Since sup,>1 S,/k > X, F = D. The inequality E |X*| < E |X| + e allows 
the use of the bounded convergence theorem, so we conclude that 


Í xap — Í X* dP. 
Fn F 


f xar >o. 
D 


Therefore, 


But 
Í X* dP -Í X dP — eP(D) 
D D 


if 


Í E(X | 3) dP — «P(D) 
D 


= —eP(D), 


which implies P(D) = 0, and X < 0 a.s. Apply the same argument to the 
random variable —X(w). Here the lim sup of the sums is 


= S 
lim (- `a) = —lim 5 = —X. 
n —\n 
The conclusion above becomes —X < 0 or X > 0 as. Putting these two 
together gives the theorem. Q.E.D. 
A consequence of 6.21 is that if T is ergodic, time averages can be 


replaced by phase averages, in other words, 


Corollary 6.23. Let T be measure-preserving and ergodic on (Q, F, P). Then 
for X any random variable such that E |X| < œ, 


n-l 
lim Í S X(T*o) = EX as. 
n N k=0 


Proof. Every set in J has probability zero or one, hence 
E(X |3) = EX a.s. 
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6. CONVERSES AND COROLLARIES 


It is natural to ask whether the conditions of the ergodic theorem are 
necessary and sufficient. Again the answer is—Almost. If X is a non- 
negative random variable and EX = œ, it is easy to show that for T measure- 


preserving and ergodic, 
1 n—1 


-= ¥X(T*w) > © as. 
noo 


Because defining for « > 0, 
X(w), if X(w) < «, 


X,(o) = l 
0, otherwise, 


of course, E|X,| < œ. Thus the ergodic theorem can be used to get 
n=l n—i 
lim +S x(T*w) > lim t 5 X,(T*o) = EX» a.s. 
—~n o n o 


Take « ? œ to get the conclusion. 

But, in general, if E |X| = œ, it does not follow that the averages diverge 
a.s. (see Halmos [65, p. 32]). 

Come back now to the question of the asymptotic density of the points 
w, Tw, T?w, ... In the ergodic theorem, for any A e F, take X(w) = x4(w). 
Then the conclusion reads, if T is ergodic, 


1 n-1 
p Zra(T a) Ss P(A), 


so that for almost every starting point w, the asymptotic proportion of 
points in A is exactly P(A). If Q has a topology with a countable basis such 
that P(N) > 0 for every open neighborhood N, then this implies that for 
almost every w, the set of points w, Tw, T*w, ... is dense in Q. 

Another interesting and curious result is 


Corollary 6.24. Let T: Q — Q be measure-preserving and ergodic with respect 
to both (Q, F, Py) and (Q, F, P,). Then either P, = P, or P, and P, are 
orthogonal in the sense that there is a set A € 3 such that P(A) = 1, P,(A°) =1. 


Proof. If P, # Po, take BeF such that P,(B) ¥ P(B) and let X(w) = 
%n(w). Let A be the set of w such that 


n—-l 
1 > X(T*w) — P(B). 
n o 
By the ergodic theorem P,(A) = 1. But 4° includes all œ such that 
n-i 
L'S X(T*o) > PB), 
n o 


and we see that P,(4°) = 1. 
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Finally, we ask concerning convergence in the first mean of the averages 
to EX. By the Lebesgue theorem we know that a.s. convergence plus just 
a little more gives first mean convergence. But here we have to work a bit 
to get the additional piece. 


Corollary 6.25. Under the conditions of the ergodic theorem 
1 n—1 
lim E|- > X(T*w) — E(X] | = 0. 
n n k=0 
Proof. We can assume that E(X | 3J) = 0. Let 


V, == >X(Tte). 


n—-1 
k=0 


3 le 


Since V, —-> 0, by Egoroff’s theorem, for any « > 0, 34 €F such that 
P(A) < « and V, — 0 uniformly on 4°. Now, 


—_— — _— 1 wl 
im EWS i [ IVi dP < im} ¥ Í Xl dP, X, = X(T*o). 
n n JA n N 0 JA 


These integrals can be estimated by 


Í IX, | dP =Í IX,] dP +Í \X,| dP 
A AM Xz] > N} AMIX SN} 


<Í |X,| dP + NP(A) < Í |X| dP + NP(A). 
A OLNX E> N} {1X1 >N} 
Since ¢ is arbitrary, conclude that for any N, 


lim E |V,| < i |X| dP. 
n {x> N} 
Let N go to infinity; then by the bounded convergence theorem, the right- 
hand side above goes to zero. 


Problems 


9. Another consequence of the ergodic theorem is a weak form of Weyl’s 
equidistribution theorem. For any x in [0, 1) and interval J = [0, 1), let 
R™®(I) be the proportion of the points {(x + A)[I], (x + 250], ..., 
(x + nA){1]} falling in the interval 7. If å > 0 is irrational, show that for x 
in a set of Lebesgue measure one, R™(7) — length J. 

10. Let p be a finite measure on B,([0, 1)) such that Tx = 2x[1] preserves 
u-measure. Show that u is singular with respect to Lebesgue measure. 
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11. Let T be measurable on (Q, F), and define M as the set of all probabilities 
P on ¥ such that T is measure-preserving on (Q, F, P). Define real linear 
combinations by («P, + BP,)(B) = «P,(B) + P(B), BEF. Show that 


a) A is convex, that is, for «, p > 0,« +P =1, 
P,, Pa E M => aP, + BP, E M. 


An extreme point of A is a probability P e A which is not a linear com- 

bination «P, + £P, a, B >0,a+ 8 =1, with P,, P€ M. Show that 

b) the extreme points of A are the probabilities P such that T is ergodic on 
(Q, F, P). 


7. BACK TO STATIONARY PROCESSES 


By the ergodic theorem and its corollary, if the shift-transformation S (see 
6.10) is ergodic on (R‘*, B,, Ê), then 


l > X, > EX, a.s. and first mean. 
ni 


If S is ergodic, then, the same conclusions will hold for the original X,, 
Xə, . . . process, because a.s. convergence and rth mean convergence depend 
only on the distribution of the process. 

Almost all the material concerning invariance and ergodicity, can be 
formulated in terms of the original process X,, X,,... rather than going into 
representation space. If Be B, and A = {X e B}, then the inverse image 
under X of S~1B, S the shift operator, is 


{X e SB} = {w; (Xa Xy,...) E B}. 
Hence, we reach 


Definition 6.26. An event A € ẸF is invariant if 3B € Ba such that for every 
n> 1 
A = {(X%,, Kasi Es ) € B}. 


The class of invariant events is easily seen to be a o-field. Similarly, we define 
a random variable Z to be invariant if there is a random variable » on 
(R'~), B,,) such that 


(6.27) Z = A(X Xn...) aln > 1. 


The results of Section 4 hold again; Z is invariant iff it is J-measurable. 
The ergodic theorem translates as 


Theorem 6.28. If X,, Xa, ... is a stationary process, J the o-field of invariant 
events, and E |X,| < œ, then 


3 lee 


5X25 EX, | J). 
1 
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Proof. From the ergodic theorem, S,,/n converges a.s. to some random vari- 
able Y. It is not difficult to give an argument that the correct translation is 
Y = E(X, | 3). But we can identify Y directly: take Y = lim S,,/n, then the 


sets {Y < y} are invariant, hence Y is J-measurable. Take A € J, then since 
we have first mean convergence, 


(6.29) 15 Í X, dP > { Y aP. 
noiJaA A 


Select B E€ B, so that A = {(X,,...) € B}, for all k > 1. Now stationarity 


gives 
Í X, dP = X, dP =Í X, dP. 
{Xz --) EB} {X ...) EB} A 


Use this in 6.29, to conclude 


Í x, dp = | YdP, allAed. 
A A 


By definition, Y = E(X, | 3). 


Definition 6.30. A stationary process X,, Xo, ... is ergodic if every invariant 
event has probability zero or one. 


If J has this zero-one property, of course the averages converge to EX}, a.s. 
Ergodicity, like stationarity, is preserved under taking functions of the 
process. More precisely, 


Proposition 6.31. Let X,, Xs,... be a stationary and ergodic process, p(x) 
measurable B,,, then the process Y,, Ys, . . . defined by 


Ye = P(X Xun- -) 
is ergodic. 


Proof. This is very easy to see. Use the same argument as in Proposition 
6.6 to conclude that for any B € Ba, JA € Ba such that 

{os (Yo Ya...) E B} = {m3 (Xr Xa...) € A}, 

{w; (Yo, Ys, ere .) E€ B} = {w; (Xa, X3, SA .) E€ A}. 


Hence, every invariant event on the Y-process coincides with an invariant 
event on the X-process. 


One result that is both interesting and useful in establishing ergodicity is 


Proposition 6.32. Let X,, Xz, ... be a stationary process. Then every in- 
variant event A is a tail event. 


Proof. Take B so that A = {(X,, Xn ---) E B} n21. Hence Ae 
F(X, Xni - - -), alla. 
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Corollary 6.33. Let X,, Xa, ... be independent and identically distributed; 
then the process is ergodic. 


Proof. Kolmogorov’s zero-one law. 


By this corollary we can include the strong law of large numbers as a 
consequence of the ergodic theorem, except for the converse. 


Problems 


12. Show that the event {X,, € Bi.o.}, B € B, is invariant. 
13. If X,, X,,... is stationary and if there 3B € Ba such that 


A = {(X), X,...) E B} = {(Xq, Xa, ...) E B} a.s., 


show that A is a.s, equal to an invariant event. 

14. A process X,, X2,... is called a normal process if X,,..., X, have a 

joint normal distribution for every n. Let EX, = 0, T,, = EX,X,. 

1) Prove that the process is stationary iff ’,, depends only on |i — j}. 

2) Assume I’, = r(|i — ji). Then prove that lim,, r(m) = 0 implies that the 
process is ergodic. 

[Assume that for every n, the determinant of Dp ij=1,...,n, is not 

zero. See Chapter 11 for the definition and properties of joint normal 

distributions. | 


15. Show that X,, Xa, . . . is ergodic iff for every A E€ Bk = 1,2,..., 


1 
x XalXn EA Xa) esa P((X,, sete: 9 Xz+) E€ A). 


“Mea 


16. Let X,, X:,... and Yj, Yo,... be two stationary, ergodic processes on 
(Q, F, P). Toss a coin with probability p of heads independently of X and 
Y. If it comes up heads, observe X, if tails, observe Y. Show that the resultant 
process is stationary, but not ergodic. 


8. AN APPLICATION 


There is a very elegant application of the above ideas due to Spitzer, Kesten, 
and Whitman (see Spitzer, [130, pp. 35 ff]). Let X,, Xa, . . . be a sequence of 
independent identically distributed random variables taking values in the 
integers. The range R, of the first n sums is defined as the number of distinct 
points in the set {S,,...,S,}. Heuristically, the more the tendency of the 
sums S, to return to the origin, the smaller R, will be, because if we are at a 
given point k at time n, the distribution of points around k henceforth looks 
like the distribution of points around the origin starting from n = 1. To 
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pin this down, write 


P(no return) = P(S, #0, n=1,2,...), 
then, 


Proposition 6.34 


lim Pira P(no return). 
n n 


Proof. Write R, = > W,, where 
k=0 


W, = 1 if $5,458, j=H1,...,k—-1, 
0, otherwise. 
So now, 
EW, = P(S, — Si # 0,5, — Sp- #0,...,5, — S, Æ 0) 
= P(X, Æ 0, Xa + X41 Æ 0,...,X, H° + Xa Æ 0) 


= P(S, Æ 0,...,5, 1 Æ 0), 
the last equality holding because (X,,..., Xa) has the same distribution as 
(%,..., %,-1). Therefore lim, EW, = P(no return). 
The remarkable result is 
Theorem 6.35 


. R 
lim — = P(no return) a.s. 
n 


Proof. Take N any positive integer, and let Z, be the number of distinct 
points visited by the successive sums during the time (k — 1)N + 1 to 
KN, that is, Z, is the range of {S-nN+1 - - -> Sey}. Note that Z, depends 
only on the X, for n between (k — 1)N + 1 and kN, so that the Z, are 
independent, |Z,| < N, and are easily seen to be identically distributed. Use 
the obvious inequality Ray < Zı + *:- + Z, and apply the law of large 
numbers: 


— 1 1 
lim = < lim — (Z, + °- + Z,) = — EZ}. 
Sima ) ee 


n nN 
For n' not a multiple of N, R, differs by at most N from one of R,„y, so 
ima < = EZ, as. 
n n N 


But Z, = Ry, hence letting N — œ, and using 6.34, we get 


— R 
lim — < P(no return) a.s. 
n 


n 
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Going the other way is more interesting. Define 
i if S, #S, j>k, 


0, otherwise. 


V, = 


That is, V, is one if at time k, S, is in a state which is never visited again, and 
zero otherwise. Now V, + +-+ + V,, is the number of states visited in time 
n which are never revisited. R, is the number of states visited in time n 
which are not revisited prior to time n + 1. Thus R, > Vi +--: + Ve 
Now define 


i, 8,40, k=1,..., 
1% Xd =| ae 
0, otherwise, 


and make the important observation that 
; =| X, Æ 0, Xp, + Xu ¥0,..., 
0, otherwise, 
= P(X Xey <- -)- 
Use 6.31 and 6.33 to show that Vj, V2, . . . is a stationary, ergodic sequence. 


Now the ergodic theorem can be used to conclude that 


dik Sin ee 
= n 


= EV, a.s. 


Of course, EV, = P(no return), and this completes the proof. 


9. RECURRENCE TIMES 


For Xa X,,... stationary, and any set A € B, look at the times that the 
process enters A, that is, the n such that X, € A. 


Definition 6.36. For A € B,, P(X, € A) > 0, define 
R, 
R, 


min {n; X € A, n > 0}, 
min {n; X, E A, n > Rj}, 


and so forth. These are the occurrence times of the set A. The recurrence times 
Ti, Ta, . . . are given by 


(6.37) T=R, eRe Ri T= R Ra: 


If {X,} is ergodic, then P(X, € A i.o.) = 1; so the R, are well defined a.s. 
But if {X,,} is not ergodic a subsidiary condition has to be imposed to make 
the R, well defined. At any rate, the smaller A is, the longer it takes to get 
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back to it, and in fact we arrive at the following proposition: 
Proposition 6.38. Let Xo, X,,... be a stationary process, A € B such that 
(6.39) P(X, € A at least once) = 1, 


then the R,,k = 1,... are finite a.s. On the sample space Q4 = {w; Xo E A} 
the T,, Ts,... form a stationary sequence under the probability P(- | Xa € A), 
and 

1 


E(T, | X, E€ A) = ————— 
(Ty | Xo € A) P(X € A) 


Remarks. This means that to get the T,, T., . . . to be stationary, we have to 
start off on the set A at time zero. This seems too complicated because once 
we have landed in A then the returns should be stationary, that is, the 
T., Tg, . . . Should be stationary under P. This is not so, and counterexamples 
are not difficult to construct (see Problem 20.) 

Note that P(X, € A) > 0, otherwise condition (6.39) is violated. 
Therefore, conditional probability given {X, € A} is well defined. 


Proof. Extend {X,} to 0, +1, +2,... By (6.39), P(R < œ) = 1. From 
the stationarity 


P(T, < 1, Ry-a =j) = P(R, < n, C), 
where C € F(X», X_1,...). Let n —> œ, so that we get 
P(T, < 00, Ri = j) = P(R = j). 


Go down the ladder to conclude that P(R, < œ) = 1 implies P(R, < 0) = 
lk >l. 
To prove stationarity, we need to establish that 


P(T = m,..., Ty, = my, | Xo € A) 
= P(T: = m,..-, Taya = My | Xo € A). 


This is not difficult to do, but to keep out of notational messes, I prove only 
that 
P(T, = n| X€ A) = PCT, = n| X€ A). 


The generalization is exactly the same argument. Define random variables 
Un, = x~4(%,), and sets C, by 


C, = {U = 0,..., UL = 0, UL, = 1}. 
The {C,} are disjoint, and 


U C, = {X,€A at least once, n < —1}. 
1 
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I assert that P(U c.) = 1, because 
1 


P(U Cx) = lim r( U, (U, = 9); 


By stationarity of the {U,} process, 


-1 n 
P( TE E= 1) = P(U {U, = 13). 
—n—1 0 
The limit of the right-hand side is P(X, € A at least once), which is one by 
(6.39). Now, using stationarity again, we find that 


P(T, = n, Xo E€ A) = } P(T: = n, T, = k, X€ A) 
k=l 


= 2 PT, = n, X, E A, Ch) = PO, = n, Xo E A). 


To compute E(T, | Xo E A), note that 


P(T, > k, X, € A) = P(C,). 
Consequently, 


E(T, | Xo E€ A) = 2 P(T: 2 k| XE A) 
1 
P(C ——. 
A e TER (C= P(X, € A) 
We can use the ergodic theorem to get a stronger result. 


Theorem 6.40. If the process {X,} is ergodic, then the process {T,} on {X, € A} 
is ergodic under P(- | X, € A). 


Proof. By the ergodic theorem, 
1 8. 
> È aX) > PX E A). 
k= 


On every sequence such that the latter holds, R, — œ, and 


Ra 
=o È 141%) > PK € A), 


The sum 2 xa(X%,) is the pu of visits X, makes to A up to the time of 


the nth Suren, Thus > xA(X%,) = n, so 


Tite + T, as. 1 ; 
n > PXE A) 
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Note that for every function f measurable &,,, there is a function g 
measurable B,,({0, 1}) such that on the set {R,_, = j}, 


JT e Tio i) = gU Uto- -)s 
where U, denotes 7 ,(X,,) again. Therefore 
n Rn- 
E fT...) = FV eU yr.) 
k=1 j=0 


Since R,_,— 0 a.s., if E|g| < œ we can use the ergodic theorem as 
follows: 


1 Re. 1 Pg 
(6.41) - È fe...) = : > Uglan.) 
n k= n R,_, j= 
1 


> P(X, € A) E(Uyg(U,, oa )) a.s. 


Because U, = 74(%), this limit is 
E(g(U,, ..) | Xo € A) or E(f(Th, «. .) | Xo € A). 


On the space {X, € A}, take f to be any bounded invariant function of 
T,,..., that is, 


Sa Tə, oe .) = f(T Tras oe -)s all k > 1. 


Then (6.41) implies that f(T), ...) is a.s. constant on {X € A}, implying 
in turn that the process T,, T,,... is ergodic. 


10. STATIONARY POINT PROCESSES 


Consider a class of processes gotten as follows: to every integer n, positive 
and negative, associate a random variable U„ which is either zero or one. A 
way to look at this process is as a sequence of points. If a point occurs at 
time n, then U„ = 1, otherwise U, = 0. We impose a condition to ensure 
that these processes have a.s. no trivial sample points. 


Condition 6.42. There is probability zero that all U, = 0. 


For a point process to be time-homogeneous, the {U,,} have to form a 
stationary sequence. 


Definition 6.43. A stationary discrete point process is a stationary process {U,}, 
n= 0, +1,... where U,, is either zero or one. 


Note: P(U, = 1) > 0, otherwise P(all U, = 0) = 1. Take A = {1}, then 
(6.38) implies that given {U, = 1} the times between points form a stationary 
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sequence T, Ta, ... of positive integer-valued random variables such that 


fe ae oe or Pe tee 
P(U, = 1) ET, 

The converse question is: Given a stationary sequence T,, T,,... of 
positive integer valued random variables, is there a stationary point process 
{U,} with interpoint distances having the same distribution as T,, T2,...? 
The difficulty in extracting the interpoint distances from the point process was 
that an origin had to be pinned down; that is, {U = 1} had to be given. 
But here the problem is to start with the T,, Ta, . . . and smear out the origin 
somehow. Define 

l, Jn such that T, + +--+ T, =k, 
a 0, otherwise. 
Suppose that there is a stationary process {U,} with interpoint distances 
having the same distribution as T,,... Denote the probability on {U,} by 
Q. Then for s any sequence k-long of zeroes and ones 


O((U,,...,U,) = s | Uo = 1) = P((V,,...,V) = 8). 
This leads to 
(6.44) 
O((U,,..., Ux) =s) =FOU,...U) = s, Us =0,...,0,U, =1) 


= 2 Uses ..., U5) =s,U,=0,...,U, = 1) 
= 


= OU, = DE Pern --+) Vegi) = 8, T) > j) 


-= ae Z Piss + +> Vand = 8, Ty > j). 

The right-hand side above depends only upon the probability P on the 
T,, Ty... process. So if a stationary point process exists with interpoint 
distances T,, T,,... it must be unique. Furthermore, if we define Q directly 
by means of (6.44), then it is not difficult to show that we get a consistent set 
of probabilities for a stationary point process having interpoint distances with 


distribution T,, T,,... Thus 


Theorem 6.45. Let Ti, T}... be a stationary sequence of positive integer 
valued random variables such that EV, < œ. Then there is a unique stationary 
point process {U} such that the interpoint distances given {Uy = 1} have the 
same distribution as T,, Ta... 
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Proof. The work is in showing that the Q defined in (6.44) is a probability for 
the desired process. To get stationarity one needs to verify that 


Q(U -> Us) = $) = O((U,,..., U1) = 8). 


Once this is done, then it is necessary to check that the interpoint distances 
given {U, = 1} have the same distribution as T,, T,,... For example, 
check that 


Q(U, = 1, Ua =0,...,U, = 0| U, = 1) = PCT, =j). 


All this verification we leave to the reader. 


Problems 


17. Given the process T,, T,,... such that T, = 2, all k, describe the 
corresponding stationary point process. 

18. Ifthe interpoint distances for a stationary point process, given {U, = 1} 
are T,, T,,..., prove that the distribution of time n* until the first point 
past the origin, that is, 


n* = min fn; n > 1, U, = 1} 
is given by 


P(n* =n) = ae Ph SH): 
1 


19. If the interpoint distances T,, Ts,... given {Uy = 1} are independent 
random variables, then show that the unconditioned interpoint distances 
Ta, Ts,... are independent identically distributed random variables with the 
same distribution as the conditioned random variables and that (Ta, Ts, . . .) is 
independent of the time n* of the first point past the origin. 

20. Consider the stationary point process {U,} having independent inter- 
point distances T,, T,,... with P(T, = 1) = 1 — e, with e very small. Now 
consider the stationary point process 


V,=1-U,, n=0,+1,... 
Show that for this latter process the random variables defined by the dis- 


tance from the first point past the origin to the second point, from the second 
point to the third point, etc., do not form a stationary sequence. 


NOTES 

The ergodic theorem was proven by G. D. Birkhoff in 1931 [4]. Since 
that time significant improvements have been made on the original lengthy 
proof, and the proof given here is the most recent simplification in a sequence 
involving many simplifications and refinements. 
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The nicest texts around on ergodic theorems in a framework of measure- 
preserving transformations are Halmos [65], and Hopf [73] (the latter 
available only in German). From the point of view of dynamical systems, see 
Khintchine [90}, and Birkhoff [5]. Recently, E. Hopf generalized the ergodic 
theorem to operators T acting on measurable functions defined on a measure 
space (Q, F, u) such that T does not increase L,-norm or L,,-norm. For 
this and other operator-theoretic aspects, see the Dunford-Schwartz book 
[41]. 

For X,,... a process on (Q, F, P), we could try to define a set trans- 
formation S-! on F(X) similar to the shift in sequence space as follows: If 
A € F(X), then take Be B, such that A = {(X,, X,,...) € B} and define 
S14 = {(Xy, Xz, . . .) E B}. The same difficulty comes up again; B is not 
uniquely determined, and if A = {X € B,}, A = {X € B,}, it is not true in 
general that 


{%, Xz, se ) E By} = {(%, Xa, 2.8 ) E€ By}. 


But for stationary processes, it is easy to see that these latter two sets differ 
only by a set of probability zero. Therefore S—! can be defined, not on 
F(X), but only on equivalence classes of sets in F(X), where sets A,, A, € F(X) 
are equivalent if P(A, A A) = 0. Fora deeper discussion of this and other 
topics relating to the translation into a probabilistic context of the ergodic 
theorem see Doob [39]. 

The fact that the expected recurrence time of A starting from A is 
1/P(X, € A) is due to Kac [81]. A good development of point processes and 
proofs of a more general version of 6.45 and the ergodic property of the 
recurrence times is in Ryll-Nardzewski [119]. 


CHAPTER 7 


MARKOV CHAINS 


1. DEFINITIONS 


The basic property characterizing Markov chains is a probabilistic analogue 
of a familiar property of dynamical systems. If one has a system of particles 
and the position and velocities of all particles are given at time ż, the equations 
of motion can be completely solved for the future development of the system. 
Therefore, any other information given concerning the past of the process up 
to time ¢ is superfluous as far as future development is concerned. The present 
state of the system contains all relevant information concerning the future. 
Probabilistically, we formalize this by defining a Markov chain as 


Definition 7.1. A process Xo, X,,... taking values in F € B, is called a 
Markov chain if 


P(Kn41 E€A|X,,..+ Xo) = P(X i441 EA | X,,) a.s. 
for all n > 0 and A € B, (F). 


For each n, there is a version p, (A | x) of P(X,,, € A | X, = x) which 
is a regular conditional distribution. These are the transition probabilities 
of the process. We restrict ourselves to the study of a class of Markov 
chains which have a property similar to conservative dynamical systems. 
One way to state that a system is conservative is that if it goes from any 
state x at ¢ = 0 to y in time 7, then starting from x at any time ż it will be 
in y at time t + 7. The corresponding property for Markov chains is that the 
transition probabilities do not depend on time. 


Definition 7.2. A Markov chain Xo, X, . . . on F € B has stationary transition 
probabilities p(A | x) if p(A | x) is a regular conditional distribution and if for 
each A € B (F), n > 0, p(A | x) is a version of P(X,.,€A| X, = x). The 
initial distribution is defined by 


(A) = P(X, € A). 


The transition probabilities and the initial distribution determine the dis- 
tribution of the process. 
129 
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Proposition 7.3. For a Markov chain X,, X,, ... on F€ By, 
(7.4) 
P(X, E Ans Xp E Ån- - - -> Xo E Áo) 


=f fof Puan nDPa adena | xna) polda bond) 
An- An—2 Ao l 


for all A,,..., Ag E B (F). 
Proof. Use 
P(X, E An Xn E Ane» - +3 Xp E Ap) 


Í P(X, € An | Xn- ++» Xo) dP 
{Xn-1€An-1,--+ » XoEdo} 


and proceed by induction. 


The Markov property as defined in 7.1 simply states that the present 
state of the system determines the probability for one step into the future. 
This generalizes easily: 


Proposition 7.5. Let Xo, X,,... be a Markov chain, and C any event in 
F (Kati Xn <- -)> Then 


P(C |[X,,..., Xo) = P(C) X,) as. 
Having stationary transition probabilities generalizes into 


Proposition 7.6. Ifthe process Xo, X,, . . . is Markov on F € By with stationary 
transition probabilities, then for every B E€ B,,(F) there are versions of 


P((Xna1 Xni se -) E€ B |X, = x) 
and 
P(X, Xs, -..) € BI Xp = x), 
which are equal. 


The proofs of both 7.5 and 7.6 are left to the reader. 
Proposition 7.3 indicates how to do the following construction: 


Proposition 7.7. Let p(A | x) be a regular conditional probability on B,(F) for 
x €F, and n(A) a probubility on BF). Then there exists a Markov chain 
Šo, Xi; Ša, ... with stationary transition probabilities p(A | x) and initial 
distribution n. 


Proof. Use (7.4) to define probabilities on rectangles. Then it is easy to 
check that all the conditions of 2.18 are met. Now extend and use the 
coordinate representation process. What remains is to show that any 
process satisfying (7.4) is a Markov chain with p(A | x) as its transition 
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probabilities. (That P(X, € A) = 7(A) is obvious.) For this, use (7.4) for 
n — 1 to conclude that (7.4) can be written as 


P(X, EAn...) = Í PAn | Xet)P(Xq-1 E dxa Xaa E Ana) 
An-1 


n 


Í PAn | xna) PKn- € dX ya 
{2n_1€An_1,:2n-2€An—a, ~~ +} 


Xn-2 E dXn-2s . + +)s 
But, by definition, 
P(X € Am...) =| 


Hence p(A | x) is a version of P(X, € A | Xn = X,-1,..-) for all n. 


P(X, € A, | Xn - +.) dP. 
} 


{Xn-1€An_i,--- 


Now we switch points of view. For the rest of this chapter forget about 
the original probability space (Q, F, P). Fix one version p(A | x) of the 
stationary transition probabilities and consider p(A | x) to be the given data. 
Each initial distribution 7, together with p(A | x), determines a probability 
on (F'°), B,(F)) we denote by P,, and the corresponding expectation by 
E,. Under P,, the coordinate representation process Xo, X,,... becomes a 
Markov chain with initial distribution 7. Thus we are concerned with a 
family of processes having the same transition probability, but different 
initial distributions. If m is concentrated on the single point {x}, then denote 
the corresponding probability on B,(F) by P,, and the expectation by E,,. 
Under P,, Xo, X;,... is referred to as the “process starting from the point 
x.” Now eliminate the ~ over variables, with the understanding that hence- 
forth all processes referred to are coordinate representation processes with 
P one of the family of probabilities {P,}. For any 7, and B € ,,(F), always 
use the version of P,((Xo,...) € BIX, = x) given by P,((Xp,...) € B). 

In exactly the same way as in Chapter 3, call a nonnegative integer- 
valued random variable n* a stopping time or Markov time for a Markov 
chain Xo, X,,... if 

{n* = n} € F(X,..., X,)- 


Define ¥(X,, n < n*) to be the o-field consisting of all events A such that 
A {n* < n} E F(X,..., X,). Then 


Proposition 7.8. If n* is a Markov time for Xo, X;,..., then 
P(Xne, Xaq- DEBI X,, n < n*) = Py (Xo Xi...) B) as. P. 
Note. The correct interpretation of the above is: Let 
(x) = P,((Xe, Xas- - :) € B). 
Then the right-hand side is (X,+). 
Proposition 7.8 is called the strong Markov property. 
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Proof. Let AE F(X, n < n*), then the integral of the left-hand side of 
7.8 over A is 


P((Xa*, Xntyas +» -) E B, A) = X P((Xq*, Xati - - -) € B, A, n* = n) 
n=0 


T > P(X, Xr+ me ) € B, A, n* = n). 
n=0 
The set A N {n* = n} € F(X, ..., X,). Hence 


P(X, ...) € BA, n* = n) = P(Xq.-..) €B| Xn...) dP 


AN {a*=n} 


= Í o(X,) dP. 
AN {n*=n} 


Putting this back in, we find that 
P(X. .)E€B,4)=5 (Xp) dP 
} 


n=0 JAN{n*=n 
= Í o(Xn*) dP. 
A 


A special case of a Markov chain are the successive sums Sp, S4, $5,... 
of independent random variables Y,, Ya, . . . , (Sy = O convention). This is 
true because independence gives 


PSni1 E A | Sns +++ So) = PSs E A ISh) as. 


If the summands are identically distributed, then the chain has stationary 
transition probabilities: 


Pn, E A | Sp = x) = P(Y EA — x) 
= F(A — x) as. 


where F(B) denotes the probability P(Y, € B) on B,. In this case, take 
F(A — x) as the fixed conditional probability distribution to be used. Now 
letting X, have any initial distribution 7 and using the transition probability 
F(A — x) we get a Markov chain Xo, X,, .. . having the same distribution as 
Yo Yo + Yu, Yo + Yi + Yz... where Yo, Y, ... are independent, Y, has 
the distribution 7, and Y,, Yo, . . . all have the distribution F. In particular, 
the process “starting from x’? has the distribution of Sj + x, S, + x, 
Se + x,... Call any such Markov process a random walk. 


Problems 


1. Define the n-step transition probabilities p™(A | x) for all A € B,(F), 
xe F by 


P%Alx) = pAlx), PAI X) = [pre | y)p(dy | x). 
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a) Show that p™(4A | x) equals P,(X,, € A), hence is a version of 
P(X, € A| Xp, = x). 


b) Show that p'(A|x) is a regular conditional probability on %,(F) 
given x € F, and that for all A e B,(F),x € F,n,m > 0, 


p™™(A | x) = Í p™(A | y)p'™(dy | x). 


2. Let g(x) be a bounded &%,,-measurable function. Demonstrate that 
E,9(X,, Xz, . . .) is B, -measurable in x. 


2. ASYMPTOTIC STATIONARITY 


There is a class of limit theorems which state that certain processes are 
asymptotically stationary. Generally these theorems are formulated as: 
Given a process X;, Xa, . . . , a stationary process X*, Xf, . . . exists such that 
for every B E By, 


lim P(X, X41 - < -) € B) = P*((XŤ, XG, -. .) € B). 


The most well-known of these relate to Markov chains with stationary 
transition probabilities p(A | x). Actually, what we would really like to show 
for Markov chains is that no matter what the initial distribution of X, is, 
convergence toward the same stationary limiting distribution sets in, that 
is, that for all B € B,,(F), and initial distributions 7, 


lim P,((X,, -. -) € B) = P*((XŤ, . . .) € B), 


where {X*} is stationary. 
Suppose, to begin, that there is a limiting distribution 7(-) on 8,(F) 
such that for all x € F, A € B (F), 


lim P,(X, € A) = #(A). 


If this limiting distribution 7(A) exists, then from 


P.(X, € A) = f P(X, € A| X, = x)n(dx) 


= i P(X, € A)n(dx) 
comes 
P(X, € A) > 7A), all v. 
Also, 


P(X € A) = Í PX EA |X, = XP,(X, € dx). 
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For A fixed, approximate p(A | x) uniformly by simple functions of x. Taking 
limits implies that (A) must satisfy 


(7.9) HA = Í p(A | x)a(dx). 


If 7 is the limiting distribution of X,, what happens if we start the process off 
with the distribution 7? The idea is that if 7 is a limiting steady-state dis- 
tribution, then, starting the system with this distribution, it should maintain a 
stable behavior. This is certainly true in the following sense—let us start the 
process with any initial distribution satisfying (7.9). Then this distribution 
maintains itself throughout time, that is, 


P(X, € A) = P(X,_, E A) = +++ = P(X E€ A). 
This is established by iterating (7.9) to get 7(A) = f p™(A | x)w(dx). In 
this sense any solution of (7.9) gives stable initial conditions to the process. 


Definition 7.10. For transition probabilities p(A | x), an initial distribution 
m(A) will be called a stationary initial distribution if it satisfies (7.9). 


But if a stationary initial distribution is used for the process, much more is 
true. 


Proposition 7.11. Let Xo, X,,... be a Markov process with stationary 
transition probabilities such that the initial distribution n(A) satisfies (7.9); 
then the process is stationary. 


Proof. By 7.6 there are versions of P(X, E A,,...,%, E A| Xo = x), 
P(Xn41 E Ans» ++, X2 E A, | X; = x) which are equal. Since P(X, € A) = 
P(X, € A) = (A), integrating these versions over Ag gives 

P(X, € Any +--+ Xo E Ag) = P(Xnyi E Åm -- +» Xi E Ao), 
which is sufficient to prove the process stationary. 


Furthermore, Markov chains have the additional property that if 
P,(X,, € A) converges to 7(A), all A € B,(F), x € F, then this one-dimensional 
convergence implies that the distribution of the entire process is converging 
to the distribution of the stationary process with initial distribution 7. 


Proposition 7.12. If p™(A | x) — @(A), all A e BF), x € F, then for any 
Be &,(F), and all x € F, 


P (Xr -- -) € B) > P(X; Xa, - . .) E€ B). 
Proof. Write 
P,((X,, Xa - .) E B) = 9x). 


P (Xn --) E BIX, = y) = pQ), as. 


Then 
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and, also, 
P(X, Xa...) € B| Xo = x) = g(x), a.s. 
Now, 


P(X». ..)E€B|X, = x) 


= (POX JEB X, = y)P(X, € dy | Xp = x) 
or 


P(X.. .)EBIX = x) = ll oy) p(dy | x). 


Under the stated conditions, one can show, by taking simple functions that 
approximate (x) uniformly, that 


Í o(y)p™(dy |x) > f oly)#(dy) = P(Xa . . ) € B). 


Therefore, the asymptotic behavior problem becomes: How many stationary 
initial distributions does a given set of transition probabilities have, and 
does p™(A | x) converge to some stationary distribution as n > œ? 


3. CLOSED SETS, INDECOMPOSABILITY, ERGODICITY 
Definition 7.13. A set A € B (F) is closed if p(A | x) = 1 for all x € A. 


The reason for this definition is obvious; if X, € A, and A is closed, then 
X,, E A with probability one for all n. Hence if there are two closed disjoint 
sets A,, Az, then 

l, xEA, n>0, 


0, xE€A, n> 0, 


P(A, | x) = 


and there is no hope that p'")(A, | x) converges to the same limit for all 
starting points x. 


Definition 7.14. A chain is called indecomposable if there are no two disjoint 
closed sets Ay, A, E B (F). 


Use a stationary initial distribution 7, if one exists, to get the stationary 
process Xo, X,,... If the process is in addition ergodic, then use the ergodic 
theorem to assert 


(7.15) aX) > WA) a.s. 


xe 
-Ms 


Take conditional expectations of (7.15), given X, = x. Use the boundedness 
and proposition 4.24, to get 


LS p®(A|x)—> H(A), a.s. (dx). 
n i 
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Thus, from ergodicity, it is possible to get convergence of the averages of the 
p™(A | x) to m(dx) a.s. w(dx). The questions of ergodicity of the process 
uniqueness of stationary initial distributions, and indecomposability go 
together. 


Theorem 7.16. Let the chain be indecomposable. Then if a stationary initial 
distribution 7 exists, it is unique and the process gotten by using 7 as the initial 
distribution is ergodic. 


Proof. Let C be an invariant event under the shift transformation in Ba, 
ī a Stationary initial distribution. Take y(x) = P,(C). Now, using 7 as the 
initial distribution, write 


P(C | Xo) = E(P(C | Xa, Xo) | Xe) 
By the Markov property, since C € #(X,, Xni1,---) for all n > 0, 


P(C | Xi, Xo) = P(C | Xj), a.s. 
By Proposition 7.6, 


P(C | Xa = x) = P(S1C | X, = x) as. ñ, 
and by the invariance of C, then 
P(C |X, = x) = (>). 
Therefore (x) satisfies 
(7.17) P = [POY as. ad), 


By a similar argument, P(C | X,,..., Xo) = P(C | X,) = ọ(X,) a.s. and 
E(X) | Xni- Xo) = (Xn) a.s. Apply the martingale theorem to 
get ¢(X,) > xc a.s. Thus for any e > 0, 


mx; ~e< ox) l—”y=0 


because the distribution of X, is 7. So g(x) can assume only the two values 
0, 1 a.s. #(dx). Define sets 


A, = {x; P(x) = 1} An = {x5 g(x) = 0}. 
Since ¢(x) is a solution of (7.17), 


1= f gO)pl(dy |x), x € Ay, 


0 =| ppd 19, xe Ay 
except for a set D, such that 7(D) = 0. Therefore 
pA lx» = 1, xE€A,—- D, i =1,2. 
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Let us define 
Ao = A, 


A) = A, — D 
Ay) = AI" A {x5 P(A,” | x) = 1}. 


If p(Am|x) = 1, xe AM, as. ñ, then a(A‘™ — AY) = 0. Take 
C; = NE Al”. Then AC) = 7(A,), but the C, are closed and disjoint. 
Hence one of C,, C is empty, p(x) is zero a.s. or one a.s., P(C) = 0 or 1, 
and the process is ergodic. Now, suppose there are two stationary initial 
distributions 7, and 7, leading to probabilities P, and P, on (Q, F). By 
6.24 there is an invariant C such that P,\(C) = 1, but P,(C) = 0. Using the 
stationary initial distribution 7 = $ mı + $ ñ we get the probability 


P = $P, + 4P: 
which is again ergodic, by the above argument. But 
P(C) = 4P (C) + P(C) = 4. Q.E.D. 


What has been left is the problem: When does there exist a 7(A) such that 
p'(A | x) — 7A)? If F is countable, this problem has a complete solution. 
In the case of general state spaces F, it is difficult to arrive at satisfactory 
conditions (see Doob, Chapter 6). But if a stationary initial distribution 
(A) exists, then under the following conditions: 


1) the state space F is indecomposable under p(A | x); 

2) the motion is nonperiodic; that is, F is indecomposable under the transition 
probabilities p™(A | x), n = 2, 3,..., 

3) for each x € F, p(A | x) & aA); 


Doob [35] has shown that 
Theorem 7.18. lim p((A | x) = w(A) for all A € BF), x € F. 


The proof is essentially an application of the ergodic theorem and its 
refinements. [As shown in Doob’s paper, (3) can be weakened somewhat.] 


4. THE COUNTABLE CASE 


The case where the state space F is countable is much easier to understand 
and analyze. It also gives some insight into the behavior of general Markov 
chains. Hence assume that F is a subset of the integers, that we have trans- 
ition probabilities p(j| k), satisfying 


(7.19) PULK)>0, = pl k) = 1, 
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where the summation is over all states in F, and n-step transition probabilities 
p'(j| k) defined by 


PM | k) = 5 P| pli | K. 


This is exactly matrix multiplication: Denote the matrix {p(j | k)} by P; 
then the n-step transition probabilities are the elements of P”. Therefore, 
if F has a finite number of states the asymptotic stationarity problem can 
be studied in terms of what happens to the elements of matrix as it is raised 
to higher and higher powers. The theory in this case is complete and detailed. 
(See Feller [59, Vol. I, Chapter 16].) The idea that simplifies the theory in 
the countable case is the renewal concept. That is, if a Markov chain starts 
in state j, then every time it returns to state j, the whole process starts over 
again as from the beginning. 


5. THE RENEWAL PROCESS OF A STATE 


Let Xo, X,,... be the Markov chain starting from state j. We ignore 
transitions to all other states and focus attention only on the returns of the 
process to the state j. Define random variables U,, Us, .. . by 

1 ifX, =), 

U, = : 

0, otherwise. 
By the Markov property and stationarity of the transition probabilities, for 
any B E Ba ({0, 1}), (s, oeny Sn) E {0, Dew, 


(1.20) PUn. - -) € B| Un = 1, Upi = Spa.) = P((U,, . . ) € B). 


This simple relationship partially summarizes the fact that once the process 
returns to j, the process starts anew. In general, any process taking values in 
{0, 1} and satisfying (7.20) is called a renewal process. We study the behavior 
at a single state j by looking at the associated process U,, U3, . . . governing 
returns to j. Define the event G that a return to state j occurs at least once by 
G = U {X, =j} 


Theorem 7.21. The following dichotomy is in force 


P(X, =jio.) = 1, 
> P(X, =f) = o. 
1 


P(X, =ji.o.) =0, 


P(G)<1>{« 
> P(X, =) < oe, 
1 
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Proof. Let F, = {U, = 1,U,., = 0, alk >1},n>1; F= {U, = 0, 
all k > 1}. Thus F, is the event that the last return to j occurs at time n. 
Hence 


{X, =/i.o.}° = U F, 

The F, are disjoint, so g 
1 — P(X, = ji.0.) = > P (F,), 

and by (7.20), 
(7.22) PF») = P(Uni, = 0, alk >1/U, = DPKU, = 1) 

= P (FPU, = 1). 
According to the definitions, Fy = G°, hence 

P{G) = 1 = PF) = 0>P,(F,) = 0, 

for alln > 0; so P(X, = ji.o.) = 1. Then 


> PAX, =j) =0; 
1 


otherwise, the Borel-Cantelli lemma would imply PX, = j i.o.) = 0. 
If P(G) < 1, then P({F,) > 0, and we can use expression (7.22) to substitute 
for P,(F,,), getting 


(7.23) Lm POCeyiey= PAF) + ¥ P(X, = i). 
1 
This implies $ P(X, =j) < œ and thus P(X, = ji.o.) = 0. 
1 


Definition 7.24. Call the state j recurrent if P(X, = j 1.0.) = 1, transient 
if P(X, = j i.o.) = 0. 
Note that 7.21 in terms of transition probabilities reads, “j is recurrent iff 
2r POGI) = o” 

Define R, as the time of the nth return to j, and the times between returns 
as T, = Ry, T, = R, — Re k > 1. Then 


Proposition 7.25. If j is recurrent, then T,, T,, . . . are independent, identically 
distributed random variables under the probability P,. 


Proof. T, is a Markov time for Xo, X,,... By the strong Markov property 
7.8, 
Pi((Xt 41 Xr) € BI Xon S Ti) = POG, Xa ..-) E€ B). 


Therefore, the process (X;,,:,...) has the same distribution as (Xj, ...) 
and is independent of F(X,, n < T,), hence is independent of T,. 
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The T,, T,, ... are also called the recurrence times for the state j. The 
result of 7.25 obviously holds for any renewal process with T,, Ta, ..., 
the times between successive ones in the U,, Uz, . . . sequence. 

Definition 7.26. Call the recurrent state j 
positive-recurrent if E,jT, < œ, null-recurrent if EjT, = œ. 
Definition 7.27. The state j has period d > 1 if T, is distributed on the lattice 
La d > 1 under P,. Ifd > 1, call the state periodic; if d = 1, nonperiodic. 
[Recall from (3.32) that Lı = {nd}, n = 0, +1,...] 
For j recurrent, let the random vectors Z, be defined by 
Zy = (Xp,» carey Xr,,.-1)- 

Then Z, takes values in the space R of all finite sequences of integers with 
B the smallest o-field containing all sets of the form 

is.. s Xm) E R; Xk = i}, k < m, 
where i is any integer. Since the length of blocks is now variable, an in- 
teresting generalization of 7.25 is 


Theorem 7.28. The Zo, Z,,... are independent and identically distributed 
random vectors under the probability P,. 


Proof. This follows from 7.8 by seeing that T, is a Markov time. Zp is 
measurable ¥(X,, n < T,) because for Be B, 
{Z € B} AO {Ty = k} = {(X,..., Xp) E B, Ti = k} E€ F(X... © Xn), 
k<n. 
By 7.8, for A E Bo, 
P((Xrp Xr,- - +) € A, Zo € B) = P,((Xo, X1,...) € A)P(Zo € B). 
Now Z, is the same function of the X;,, Xr, +1 + » - process as Zo is of the 


Xo, X1,... process. Hence Z, is independent of Z, and has the same dis- 
tribution. 


Call events {A,} a renewal event & if the random variables y4, form a 
renewal process. Problems 3 and 4 are concerned with these. 


Problems 


3. (Runs of length at least N). Consider coin-tossing (biased or fair) and 
define 

A, = {w, = T, Own = H,...,0,-n = H}. 
Prove that {A,,} form a renewal event. [Note that A, is the event such that at 
time n a run of at least N heads has just finished.] 
4. Given any sequence ty, N long of H,T, let (Q,¥, P) be the coin- 
tossing game (fair or biased). Define A, = {w; (Wn -s On—wai) = ty}, 
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(A, = Ø ifn < N). Find necessary and sufficient conditions on ty for the 
{A,} to form a renewal event. 


5. If P(X, = ji.o.) = 0, then show that 
1 


PX, = j never occurs) = ———_——-—_—_—_ - 


6. Use Problem 5 to show that for a biased coin 
P(no equalizations) = |p — g|. 


7. Use Theorem 7.21 to show that if {Z,} are the successive fortunes in a 
fair coin-tossing game, 
P(Z,, = 0i.0.) = 1. 


6. GROUP PROPERTIES OF STATES 


Definition 7.29. If there is an n such that p(k | j) > 0, j to k (denoted by 
j— k) is a permissible transition. If j +k and k —> j, say that j and k 
communicate, and write j <> k. 


Communicating states share properties: 


Theorem 7.30. If j> k, then j and k are simultaneously transient, null- 
recurrent, or positive-recurrent, and have the same period. 


Proof. Use the martingale result, Problem 9, Chapter 5, to deduce that 
under any initial distribution the sets {X,, = j i.o.} and {X,, = k i.o.} have 
the same probability. So both are recurrent or transient together. Let 
T,, Ta, . . . be the recurrence times for j starting from j, and assume E,T, < œ. 
Let V,, = 1 or 0 as there was or was not a visit to state k between times 
R,- and R,. The V, are independent and identically distributed (7.28) with 
P(V, = 1) > 0. Denote by n* the first n such that V, = 1, T the time of the 
first visit to state k. Then 


E,T<> R, dP = > T,, aP;. 
n=1J{n*=n} n=1 J {n*>n} 
But {n* > n} = {n* < n}, and {n* < n} E F(X, k < Ra) thus is in- 
dependent of T,. Hence 


E,T < E;T,: > P(n* > n) = ET, E,n*. 
1 


Obviously, E;n* < œ, so ET < œ. Once k has occurred, the time until 
another occurrence of k is less than the time to get back to state j plus the 
time until another occurrence of k starting from j. This latter time has the 
same distribution as T. The former time must have finite expectation, 
otherwise £;T, = œ. Hence k has a recurrence time with finite expectation. 


142 MARKOV CHAINS 7.6 


Take n, ng such that p(k |j) > 0, p\"(j|k) > 0. If T™ is the first 
return time to k, define J, = {n; P,(T = n) > 0}, La, is the smallest lattice 
containing J,, La, the smallest lattice containing J;. Diagrammatically, we 
can go: 


That is, if med, then n + m + n, El, or I + ni +n © I; Hence 
La, S Lap and d} > dı. The converse argument gives d, > dz, so d, = dọ. 
Q.E.D. 


Let J be the set of recurrent states. Communication (<>) is clearly an 
equivalence relationship on J, so splits J into disjoint classes C,, C.,... 


Proposition 7.31. For any C, 
> p(k | j) = l, allj EC, 
ket, 


that is, each C, is a closed set of states. 


Proof. If jis recurrent and j — k, then k — j. Otherwise, P(X, = ji.o.) < 1. 
Hence the set of states {k; j— k} = {k; j k}, but this latter is 
exactly the equivalence class containing j. The sum of p(k |j) over all k 
such that j — k is clearly one. 


Take C to be a closed indecomposable set of recurrent states. They all 
have the same period d. If d > 1, define the relationship Š as j ok if 
P™Ë(k |j) > 0, p'™(j| k) > Ofor somen, np. Since j <> kisan equivalence 
relationship, C may be decomposed into disjoint sets D,, De, ... under S. 
Proposition 7.32. There are d disjoint equivalence classes D,, Da.. ., Da 
under < and they can be numbered such that 


2 pklj)=1, jeD, 


keDi+1[a] 
or diagrammatically, 
D, 
5 a ee m 
| D; 


The D,,..., Da are called cyclically moving subsets. 
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Proof. Denote by j z k the existence of an n such that n{d] = /[d] and 
p™(k|j)> 0. Fix jand number D; as the equivalence class containing /,. 
Take ją such that p(ja| jı) > 0, and number D; the equivalence class con- 


Sa od 3 ne = 8 oid 
taining jọ and so on, See that j 1 k implies that k SGAN j. But 


Paa Jd È Parn lja) t PU lja) > 9, 
so hŠ Jan => Dayı = Dy. Consider any state k, such that p(k |j) > 0, 
JE D,and k ¢ Dy, say k € D}. Then key jə Look at the string 


. Od] . 1fa] ola] . ila] . ld]. oia] . 
hjk >j j> tt > janje 


This string leads to j, pe jx. From this contradiction, conclude k € D,, 


and > p(k\j/) =1,je D. 


keD2 

If C is a closed set of communicating nonperiodic states, one useful 
result is that for any two states j, k, all other states are common descendants. 
That is: 


Proposition 7.33. For j, k, l, any three states, there exists an n such that 
p™ (Lj) > 0, p™ (I| k) > 0. 

Proof. Take n,, n, such that p™»(/ | 7) > 0, p(1|k) > 0. Consider the 
set J = {n; p\™(/|1) > 0}. By the nonperiodicity, the smallest lattice 
containing J is L,. Under addition, J is closed, so every integer can be 
expressed as Sım, — Soto, Mı, Mm, EJ, Sı, Sa nonnegative integers. Take 
Ng — Ny = SM — SMa, SO Ny + Sym, = Ny + Sym, = n. Now check that 
POUL) > 0, PMU) > 0. 


Problems 


8. If the set of states is finite, prove that 


a) there must be at least one recurrent state; 

b) every recurrent state is positive recurrent; 

c) there is a random variable n* such that for n > n*, X, is in the set of 
recurrent states. 


9. Give an example of a chain with all states transient. 


7. STATIONARY INITIAL DISTRIBUTIONS 

Consider a chain Xp, X,,... starting from an initial state i such that i is 
recurrent. Let N,( j) be the number of visits of X,,..., X, to j, and 7( j) be 
the expected number of visits to j before return to i, 7(i) = 1. The proof of 
Theorem 3.45 goes through, word for word. Use (3.46) again to conclude 


N,Q) as, zU) 
N) r) 
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The relevance here is 


Proposition 7.34 
mk) = $ plk | rQ) 
2 


for all k such that ik. 


Proof. A visit to state j occurs on the nth trial before return to state i if 
{X, = J, Xap Æ i,..., X, Æ i}. Therefore 


mj) = > P(X, =j, X, #1... X1 Æ |, 
n=} 


so that 
> p(k | ji) = 2 2 P%ns = k| X, =J)Pi(Xy =j, Xp-1 # i, e...’ X, x i) 
I n=13Ft 
+ p(k | i) 
= > P(X, = k, X, # i, X,1 Æ PEE Xy x i) 
n=1 


+ P(X, = k) 
= n(k), ki. 


Fork = i, P(X,4, = i X, Æ i... X, Fi) = P(T =n + 1), the right- 
hand side becomes $? P,T = n) = 1, where T is the time of first 
recurrence of state i. 

The {7(j)}, therefore, form a stationary initial measure for the chain. 
By summing, we get 


© 0 
Xr) = > PAX, i... Xi KD = TPT! > 2) = ET- 1 
j+i n=1 n=1 
or 
> a(j) = Ess 
i 
If i is positive-recurrent, then 7(j) = a(/)/E;T™ forms a stationary initial 
distribution for the process. By Proposition 6.38, if T? is the first recurrence 
time for state j, starting from j, then 
ea 1 
aj) = ET: 
2 
Every equivalence class C of positive-recurrent states thus has the unique 
stationary initial distribution given by 7(j). Note that 7(/) > 0 for all j in 


7.8 SOME EXAMPLES 145 


the class. Use the ergodic theorem to conclude 


! Y pUl — a), all kec. 
N m=1 


Restrict the state space to such a class C. 


Proposition 7.35. Let A(j) be a solution of 


Ai) = È pU | OAC) 
k 
such that ¥ |A(j)| < 00. Then there is a constant c such that A(j) = c7(j). 


Proof. By iteration 
AG) = È POIDA), m21. 
k 


Consequently, 
w) = F G $ pao). 


The inner term converges to 7(j), and is always <1. Use the bounded 
convergence theorem to get 


AG) = 7) 2 A(k). 


This proposition is useful in that it permits us to get the 7( j) by solving 
the system #(j) = > p(j | a(k). 
k 


8. SOME EXAMPLES 


Example A. Simple symmetric random walks. Denote by J the integers, and 
by J’) the space j = (jı, - - - , jm) of all m-tuples of integers. If the particle 
is at j, then it makes the transition to any one of the 2m nearest neighbors 
Gir & Ly jes ++ sJm)s (oja + ls.. ->jm)-.- with probability 1/2m. The 
distribution of this process starting from j is given by j + Y, +: + Y,, 
where Y,, Y}... are independent identically distributed random vectors 
taking values in (+1,0,...,0), (0,41,0,...,0) with equal prob- 
abilities. All states communicate. The chain has period d = 2. Denote 
0 = (0,0,...,0). For m = 1, the process starting from 0 is the fair coin- 
tossing game. Thus for m = 1, all states are null-recurrent. Polya [116] 
discovered the interesting phenomenon that if m = 2, all states are again 
null-recurrent, but for m > 3 all states are transient. In fact, every random 
walk on J‘ that is genuinely m-dimensional is null-recurrent for m < 2, 
transient for m > 3. See Chung and Fuchs [18]. 
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Example B. The renewal chain. Another way of looking at a renewal process 
which illuminates the use of the word renewal is the following: At time zero, 
a new light bulb is placed in a fixture. Let T, be the number of periods 
(integral) that it lasts. When it blows out at time n, it is replaced by another 
light bulb starting at time n that lasts T, periods; the kth light bulb lasts 
T, periods. The light bulbs are of identical manufacture and a reasonable 
model is to assume that the T,, T,,... are independent and identically 
distributed random variables. Also assume each bulb lasts at least one 
period; that is, P(T, > 1) = 1. Let A, be the event that a light bulb blows 
out at time n. Intuitively, this starts the whole process over again. Mathe- 
matically, it is easy to show that {A,} form a renewal event. Formally, 
A, = {w; Jak such that T, +: + T, = n}. Now the point is that 7.25 
shows that the converse is true; given any renewal process U,, U,,..., and 
letting T,, T.,... be the times between occurrences of {U, = 1}, we find 
that if P(T, < œ) = 1, then 


{U, = 1} = {w; J ak such that T, +--: + T, = 7}, 


where the T,, T., .. . are independent and identically distributed. 

For a Markov chain X,, X4, . . . starting from state j, the events {X, = j} 
form a renewal event. Now we ask the converse question, given a renewal 
process U,, Uz, . . . , is there a Markov process Xp, Xj, ... starting from the 

A A 
origin such that the process U,, Uz, . . . , defined by 
a 1, X,=0 
U, = n > 
0, otherwise, 
has the same distribution as U,, U,,... ? Actually, we can define a Markov 
process Xo, X;,...on the same sample space as the renewal process such that 


{U, = 1} = {X, = 0}. 


Definition 7.36. For a renewal process U,, Us,..., add the convention 
Ry = 0, and define the time of the last replacement prior to time n as 


Ta = R, on the set {w; R, <n < Ry}. 
The age of the current item is defined by 


X, =n — Ta 
Clearly, 


{X, = 0} = {«, = n} = {3 a k such that R, = n} = {U, = 1}. 


The X,, process, as defined, takes values in the nonnegative integers, and 
Xə = 0. What does it look like? If X, = j, then t, = n — j and on this 
set there exists no k such that n — j <R, < n. Therefore, on the set 
t =n — j, either Tay =A —jort,,, =" +1. So if X, = j, either 
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Xa = j + 1 or X, = 0. Intuitively, either a renewal takes place at time 
n + 1 or the item ages one more time unit. Clearly, Xp, X,,...isa Markov 
chain with the stationary transition probabilities 


Pau =J/+1[X, =) = Py =n—jlt, =n j) 
=P(Ty>j+1/T, >j) 
-PG >j+) 
P(T, > j) 
All states communicate, the chain has period determined by the minimum 
lattice on which T, is distributed, and is null-recurrent or positive-recurrent 
as ET, = oor <o. 

If ET, < ©, there is a stationary point process U*, Uf,... having 
interpoint distances T*, T¥,... with the same distribution as Tj, Tp... 
For this process use R* as the time of the kth point past the origin, and define 
«*, X* as above. The transition probabilities of X* are the same as for 
X,» but X* is easily seen to be stationary. At n = 0, the age of the current 
item is k on the set {UF = 0,..., U*,,, = 0, U*, = 1}. The probability 
of this set is, by stationarity, 


p*(Ut = 0,..., UŤ = 0, UF = 1) = P(T, > k)/ET,. 


Hence a(k) = P(T, > k)/ET, is a stationary initial distribution for the 
process. The question of asymptotic stationarity of Xo, X4, - . . is equivalent 
to asking if the renewal process is asymptotically stationary in the sense 


P((Un Ungas - - -) € B) > P*((UF, . . .) € B) 
for every B € B,, ({0, 1}). 


Problem 10. Show that a sufficient condition for the U,,U,, . . . process to be 
asymptotically stationary in the above sense is 


PU, = 1I)> ES . 
ET, 
Example C. Birth and death processes. These are a class of Markov chains 
in which the state space is the integers J or the nonnegative integers [+ 
and where, if the particle is at j, it can move either to j + 1 with probability 
a,, to j — 1 with probability £, or remain at j with probability 1 — a; — £;. 
If the states are J, assume all states communicate. If the states are Z+, 
0 is either an absorbing state (defined as any state i such that p(i| i) = 1) 
or reflecting (p(1| 0) > 0). Assume that all other states communicate between 
themselves, and can get to zero. Equivalently, x, # 0, 8; Æ 0, for j > 0. If0 
is absorbing, then all other states are transient, because j > 0 but 0+> j, j # 0. 
Therefore, for almost every sample path, either X, —— œ or X, > 0. If 0 
is reflecting, the states can be transient or recurrent, either positive or null. 
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To get a criterion, let t* be the first passage time to state j starting from zero: 
t} = min {n; X, = j}, X, = 0. 
Let A, be the event that a return to zero occurs between t¥ and tH, 
A, = U {X, =0, cf <n < thy}. 
The t* are Markov times, and we use 7.8 to conclude that the A, are in- 
dependent events. By the Borel-Cantelli lemma, the process is transient or 
recurrent as X? P(A;) < œ or X? P(A;) = œ. Let t* be the first time 
after t* that X, # j. Then P(A,) = E(P(A,| X,.)). Now 
P(A, |X, =j+ I) =9. 

On the set X, = j — 1, A; can occur if we return to zero before climbing to j 
or by returning to zero only after climbing to j but before climbing to 
j +1. Since 7* is a Markov time, by the strong Markov property 

P(A;|Xr- = j - 1) = P(Aj-1) + P(A§_1)P(Aj), 7 >1 
Checking that P(X,- = j — 1) =8,/(a;+ By. gives the equation 


P(4;) = = Ee (P(As-1) + PCAS-.)P(As)), 


or 
r;P(Aj- 1) 

1+ r;P(Aj- 1) 

where r; = §;/a;. Direct substitution verifies that 


, j j 
P(A) = where a= ae 


P(A;) = 


Certainly, if Ye, < oo then DPA, ) < oo. To go the other way, note 


that since s; = 3;_;/(1—- Ba), then 


But 1 i ý 
4 >> log(t + z 2) = log(sx) — log(s1). 
2 


2 
We have proved 
Proposition 7.37. A birth and death process on I+ with the origin reflecting is 


transient iff : it (2) ose 


j=1 k=l \Oy 
(Due to Harris [68]. See Karlin [86, p. 204) for an alternative derivation.) 
To discriminate between null and positive recurrence is easier. 
Problem 11. Use the condition that 

Ak) = 2 p(k Ay) 
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has no solutions such that X {A(k)| < œ to find a necessary and sufficient 
condition that a recurrent birth and death process on Z+ be null-recurrent. 
Example D. Branching processes. These processes are characterized as 
follows: If at time n there are k individuals present, then the jth one inde- 
pendently of the others gives birth to Y, offspring by time n + 1, j= 1,..., k, 
where P(Y, = /) =p, / =0,1,... The {Y; = 0} event corresponds to 
the death of the jth individual leaving no offspring. The state space is J*, 
the transition probabilities for X,, the population size at time n, are 

pm |k) = P(Y, + + Ya = m), 
where the Y,,..., Y, are independent and have the same distribution as Y. 
Zero is an absorbing state (unless the model is revised to allow the introduction 
of new individuals into the population). If pọ > 0, then the same argument 
as for birth and death processes establishes the fact that every state except 
zero is transient. If pọ = 0, then obviously the same result holds. For a 
complete and interesting treatment of these chains and their generalizations, 
see Harris [67]. 
Problem 12. In a branching process, suppose EY, =m < œ. Use the 
martingale convergence theorem to show that X,„/m” converges a.s. 
Example E. The Ehrenfest urn scheme. Following the work of Gibbs and 
Boltzmann statistical mechanics was faced with this paradox. For a system 
of particles in a closed container, referring to the 6N position-velocity vector 
as the state of the system, then in the ergodic case every state is recurrent in 
the sense that the system returns infinitely often to every neighborhood of 
any initial state. 

On the other hand, the observed macroscopic behavior is that a system 
seems to move irreversibly toward an equilibrium condition. Smoluchowski 
proposed the solution that states far removed from equilibrium have an 
enormously large recurrence time, thus the system over any reasonable 
observation time appears to move toward equilibrium. To illustrate this the 
Ehrenfests constructed a model as follows: consider two urns I and II, anda 
total of 2N molecules distributed within the two urns. At time n, a molecule 
is chosen at random from among the 2N and is transferred from whatever 
urn it happens to be in to the other urn. Let the state k of the chain be the 
number of molecules in urn I, k = 0,...,2N. The transition probabilities 


are given b IN—k k 
DAY tips, k-i. 


2N 
All states communicate, and since there are only a finite number, all are 
positive-recurrent. We can use the fact that the stationary distribution 
a(k) = 1/E;T™ to get the expected recurrence times. 
Problem 13. Use the facts that 
wk) = 2 pík IDEG) > ak) =1 


to show that 
a) WN + k) = EN an, 


(N+ K)1(N — k)! 
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Compare this with the derivation of the central limit theorem for coin- 
tossing, Chapter I, Section 3, and show that for N large, if T is the recurrence 


time for the states {k; |N — k| > x V N/2} 


1 
Vija je ef dt 


See Kac [80] for further discussion. 


b) ET= 


9. THE CONVERGENCE THEOREM 
The fundamental convergence result for Markov chains on the integers is 


Theorem 7.38. Let C be a closed indecomposable set of nonperiodic recurrent 
states. If the states are null-recurrent, then for all j,k € C 
p (ji k)— 0. 


If the states are positive-recurrent with stationary initial distribution 7(j), then 
forallj,k eC 
p™ (j| k) > (j). 


There are many different proofs of this. Interestingly enough, the various 
proofs are very diverse in their origin and approach. One simple proof is 
based on 


Theorem 7.39 (The renewal theorem). For a nonperiodic renewal process, 
A 

P(U, = 1)—> (ET, 

0, if ET, = Q, 


There is a nice elementary proof of this in Feller [59, Volume I], and we prove 
a much generalized version in Chapter 10. 


if ET, < œ, 


The way this theorem is used in 7.38 is that for {U,} the return process 
for state j, P(U, = 1) = p™( jIj); hence p(™(j| j) > a(/) if ET < œ, or 
Pp (j\j) > 0 if j is null-recurrent. No matter where the process is started, let 
T) be the first time that j is entered. Then 


(7.40) PGK) = $ pm | DPT? = m) 


by the Markov property. Argue that P(T? < œ) = 1 (see the proof of 
7.30). Now use the bounded convergence theorem to establish 


lim p™( | k) = lim p™(j| j) $ PAT? = m). 
n n m=1 
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We can also use some heavier machinery to get a stronger result due to 
Orey [114]. We give this proof, from Blackwell and Freedman [9], because 
it involves an interesting application of martingales and the Hewitt-Savage 
zero-one law, 


Theorem 7.41. Let C be a closed indecomposable set of nonperiodic recurrent 
states. Then for any states k, l in C 


Zeng Lk) — pG IDI — 0. 


Remark. We can get 7.38 in the positive-recurrent case from 7.41 by noting 
that 7.41 implies 


2 PGI — PMU D) HD > 0. 


In the null-recurrent case we use an additional fact: First, consider the 
event A,, that starting from j the last entry into j up to time n was at time 
n — m, An = {Xn-m = Js Xn-mia Æ fs - +> Xn Æj}. The Am are disjoint 
and the union is the whole space. Furthermore, for the process starting 
from j, 


PAm) = PXr-m = J)P(X,—m41 Æj, tee Xn # j| Xn-m = j) 
= pj PT! > m). 
Consequently, 


(7.42) L= > pm [jP T” > m), 
m=0 


(where p( j|/) = 1). 
Let lim p™(j |j) = p. Take a subsequence n’ such that p(j|j) > P. 
By 7.41, for any other state, p'"(j | k) — p. Use this to get 
Pr") = È PU pk |) > 8. 
k 


Then for any r > 0 fixed, p'"+"(j| j) > P. Substitute n = n' + rin (7.42), 
and chop off some terms to get 


r i 
1 2 > prem DPT? > m) py P(T” > m). 
m=0 m=0 
Noting that 


p Pere > m) = ET” = œ% 


implies 5 = 0, and using 7.41 we can complete the proof that p™(j | k) — 0. 
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Proof of 7.41. Take m any initial distribution such that 7(/) > 0. Let ġ 
be the tail o-field of the process Xo, X4, . . . and suppose that $ has the zero- 
one property under P,. Then an easy consequence of the martingale con- 
vergence theorem (see Problem 6, Chapter 5) is that for any A € F(X), 


sup _|P,(A A B) — P,(A)P,(B)| — 0. 
Be¥(Xn,...) 


Let D = C, B = {X, E€ D}, A = {X, = Jj, so that 


sup|P,(X, € D, Xo = 1) — P,(X, € D)n(I)| + 0 
Dece 
or 
sup|P,(X, € D) — P,(X,, € D)| > 0. 
Dec 


Write C} = {j; P(X, =j) > P(X, =j), Cy the complement of C} 
in C. Then by the above, 


|P(X,, E Cr) = PAX, E (ong) = 0, |P(X, E Cy) E PAX, E Cz) = 0, 


implying 
2 PX, = j) = PAX, = jJ) —>0. 


Now use the initial distribution 7 which assigns mass 4 each to the states 7 
and k to get the stated result. 
The completed proof is provided by 


Theorem 7.43. For any tail event A, either P,(A) is one for all j € C, or zero 
for all j € C (under the conditions of 7.41). 


Proof. Consider the process starting from j. The random vectors Z, = 
(Xg,>--+>» Xr) are independent and identically distributed by 7.28. 
Clearly, F(X) = ¥(Z, Z,,...). Take W a tail random variable; that is, 
W is measurable @. 

For every n, there is a random variable ¢,(x) on (R, Ba) such that 
W = 9,(%,, -.-.). So for every k, 


W= PRX R,» Xrti ++) = Reo Zo Ze- 


Now R, is a symmetric function of Zo,..., Z,_1. Hence W is a symmetric 
function of Zo, Z;,... The Hewitt-Savage zero-one law holds word by 
word for independent identically distributed random vectors, instead of 
random variables. Therefore W is a.s. constant, and P,(A) is zero or one for 
every tail event A. 

For any two states j, k let / be a descendent, p')(/ | j) > 0, p™(i | k) > 0. 
Write 

PA) a EPA l Xn ee X)). 
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Since A is measurable ¥(X,,,,, ...), 
PA) = E(P(A| X,)) = > PAX, = dpi J). 


If P(A) = 1, then P(A |X, = 1) # 0. But using k instead of j above, we 
get 
PA) > PA |X, = Dp™U|k) > 0. 


Hence P,(A) = 1, and the theorem is proved. 


From this fundamental theorem follows a complete description of the 
asymptotic behavior of the p((j|k). If a closed communicating set of 
positive recurrent states has period d, then any one of the cyclically moving 
subclasses D,, r = 1,...,d is nonperiodic and closed under the transition 
probability p‘®(j| k). Looking at this class at time steps d units apart, con- 
clude that 

d 
prj | ky > gro’ $ ke D, 
I 
fk 2> j, use (7.40) to get 
nati; d 
p' +j | D> gTa . 
If both transient states and positive-recurrent states are present then the 
asymptotic behavior of p'(j| k), j positive-recurrent and nonperiodic, k 
transient, depends on the probability P(C |k) that starting from k the 
process will eventually enter the class of states C communicating with j. 
From (7.40), in fact, 
P(C] k) 
ETO” 
When j is periodic, the behavior depends not only on P(C | k) but also at 
what point in the cycle of motion in C the particle from k enters C. 


PGi —> 


10. THE BACKWARD METHOD 


There is a simple device which turns out to be important, both theoretically 
and practically, in the study of Markov chains. Let Z = (Xp, Xj, . . .) be 
any random variable on a Markov chain with stationary transition prob- 
abilities. Then the device is to get an equation for f(x) = E(Z | X, = x) 
by using the fact that 


E(E(Z | X, = y, X, = x) | Xp = x) = E(Z| X, = x). 
Of course, this will be useful only if E(Z | X, = y, X, = x) can be expressed 


in terms of f. The reason I call this the backward method is that it is the 
initial conditions of the process that are perturbed. Here are some examples. 
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Mostly for convenience, look at the countable cases. It is not difficult to 
see how the same method carries over to similar examples in the general case. 


a) Invariant random variable. Let Z be a bounded invariant function, 
Z = 9(X,,...),n > 0. Thenif E(Z| X, =j) = f(), 

E(Z| X, = k, Xp = j) = E(W%K,.- ) |X, = k) = fh), 
so that f(j) is a bounded solution of 


(7.44) f= Xf (k)p(k | j). 


There is an interesting converse. Let f(j) be a bounded solution of (7.44). 
Then write (7.44 )as 
JX. = E(f(Xns) | X,,)- 


By the Markov property, 
SX.) = E(f (Kass) | Xn ...3 Xo). 
This says that f(X,) is a martingale. Since it is bounded, the convergence 
theorem applies, and there is a random variable Y such that 
SX) > ¥ 
If Y = AX), X,,...), from 
S(Kn+1) = Y 
conclude that Y = 6(X,, X,,...)a.s. Thus Y is a.s. invariant, and 
S = EY | X =j). 


(Use here an initial distribution which assigns positive mass to all states.) 
Formally, 


Proposition 7.45. Let m(j) > 0 for all j e F, then there is a one-to-one corre- 
spondence between bounded a.s. invariant random variables and bounded 
solutions of (7.44). 


b) Absorption probabilities. Let C,, C}, . . . be closed sets of communicating 
recurrent states. Let A be the event that a particle starting from state k is 
eventually absorbed in C, 


A = {X,, E C, all n sufficiently large}. 


A is an invariant event, so f(j) = P(A | X = j) satisfies (7.44). There are 
also the boundary conditions: 


s 1, jEC, 
JG) = 
0, jec, hÆl. 
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If one solves (7.44) subject to these boundary conditions and boundedness, is 
the solution unique? No, in general, because if J is the set of all transient 
states, the event that the particle remains in J for all time is invariant, and 


c(j) = P(X, €J,n =0,1,...) 


is zero on all C,, and any multiple of c(j) may be added to any given solution 
satisfying the boundary conditions to give another solution. If the prob- 
ability of remaining in transient states for all time is zero, then the solution 
is unique. For example, let g be such a solution, and start the process from 
state j. The process g(X,), n = 0,1,... is a martingale. Let n* be the first 
time that one of the C, is entered. This means that n* is a stopping time 
for the Xo, X,,... sequence. Furthermore g(X,) and n* satisfy the hypo- 
thesis of the optional stopping theorem. Therefore, 


Eg(X,+) = Eg(Xo) = g(j). 
But 


1, X EC, 
8(X 2) = 
0, X éC. 


Therefore g(j) = P(A | X = j). 

In fair coin-tossing, with initial fortune zero, what is the probability 
that we win M, dollars before losing M? This is the same problem as: 
For a simple symmetric random walk starting from zero, with absorbing 
states at M,, —M,, find the probability of being absorbed into M,. Let 
p*( j) be the probability of being absorbed into M, starting from j, —M, < 
j S Mı. Then pt(j) must satisfy (7.44) which in this case is 


Pr) = wt- 1) + wt), - Me <j <M, 
and the boundary conditions pt(— M) = 0, pt(M,) = 1. This solution is 
easy to get: 
j+ M: 
M, +M, 
c) Two other examples. Among many others, not involving invariant sets, 


I pick two. Let n* be the time until absorption into the class C of recurrent 

states, assuming C # Ø. Write m(j) = E;(n*). Check that 

l, keC, j€C, 

E(n* | X, = k, Xp = j) = 
1+ m(k), kC, €C, 


and apply the backward argument to give 
(7.46) m(j) = 2, m(k)p(k|j) +1, jEC. 
k 


i) = 


The boundary conditions are m( j) = 0, j € C. 
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Now let N, be the number of visits to state i before absorption into C. 
Denote G(j,i) = E,(N,). Fork €C,j7€C, 


G(k, i), j #i, 


E,(N; | X,=k)= 
Gk, i) +1, jai. 
So 
(7.47) GG, i) = 6G, i) + 2 Gtk, i)p(k | j), 


where 6(j,/) = 0 or 1 as j #i or j =i. The boundary conditions are 
G(j,i) = 0,7 € C. Of course, this makes no sense unless i is transient. 

With these last two examples there is a more difficult uniqueness problem. 
For example, in (7.46) assume that 


E,n*) < œ, allj. 
Then any nonnegative solution g(j) of (7.46) satisfying 


lim g(Xy) dP = 0 


N {n*>N} 


must be E,(n*). To prove this, check that 
n—l 
a(X,) — 2 xX) 


n*—1 
is a martingale sequence, that E( x 2%) = E,n*), and apply optional 
stopping. R 


Problems 


14. For simple symmetric random walk with absorbing states at M,, — Mz, 
show that 


a) E,(n*) =(M,+j(M, — j), 
Srna jai, 
b) E(N) = pei 
M+M) i 
M, +M; 


15. Let {X,} be simple symmetric random walk. Derive the expressions for 
pt(j) and E(n*) by showing that the sequences {X,,}, {X} — n} are martin- 
gales and applying the stopping time results of Section 7, Chapter 5. 

16. For simple symmetric random walk with absorbing states at M,, —M,, 
use the expression for pt(/) to evaluate (j) = P, (at least one return to j). 
For —M, <j<i< M, E,(N,) is the probability that particle hits i 
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before — M, times the expected number of returns to i starting from i before 
absorption. Use p*(j), for absorbing states at — Mz, i, and q(i) to evaluate 
E(N). 
17. For any given set D of states, let A be the event that X, stays in D for all 
n, 

A = {X,¢€ D, n = 0, 1,...}. 


a) Show that f(j) = P(A |j) satisfies 
SO => SAK), JED, 


fG) =0, jeD*. 


b) Prove using (a) that a state A is transient iff there exists a bounded non- 
trivial solution to the equation 


gi) = > ak)plk lj), jh. 
kth 
c) Can you use (b) to deduce 7.37? 


NOTES 


In 1906 A. A. Markov [110] proved the existence of stationary initial 
distributions for Markov chains with a finite number of states. His method 
is simple and clever, and the idea can be generalized. A good exposition is 
in Doob’s book [38, pp. 170 ff]. The most fundamental work on general 
state spaces is due to W. Doeblin [25] in 1937 and [28] in 1940. Some of these 
latter results concerning the existence of invariant initial distributions are 
given in Doob’s book. The basic restriction needed is a sort of compactness 
assumption to keep the motion from being transient or null-recurrent. But 
a good deal of Doeblin’s basic work occurs before this restriction is imposed, 
and is concerned with the general decomposition of the state space. For 
an exposition of this, see K. L. Chung [15] or [17]. 

The difficulty in the general state space is that there is no way of classifying 
each state y by means of the process of returns to y. If, for example, p(A | x) 
assigns zero mass to every one-point set, then the probability of a return to 
x is zero. You might hope to get around this by considering returns to a 
neighborhood of x, but then the important independence properties of the 
recurrence times no longer hold. It may be possible to generalize by taking 
smaller and smaller neighborhoods and getting limits, but this program looks 
difficult and has not been carried out successfully. Hence, in the general case, 
it is not yet clear what definition is most appropriate to use in classifying 
chains as recurrent or transient. For a fairly natural definition of recurrent 
chains Harris [66] generalized Doeblin’s result by showing the existence of a 
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possibly infinite, but always o-finite measure Q(dx) satisfying 


0(B) = Í P(B | x)O(dx), Be B,(F). 


His idea was very similar to the idea in the countable case: Select a set 
A € &,(F) so that an initial distribution 7,(-) exists concentrated on A such 
that every time the process returned to A, it had the distribution m4. This 
could be done using Doeblin’s technique. Then define 7(B), B € B,(F) as the 
expected number of visits to B between visits to A, using the initial distribution 
Ta 

The basic work when the state space is countable but not necessarily finite 
is due to Kolmogorov [95], 1936. The systematic application of the renewal 
theorem and concepts was done by Feller, see (55]. K. L. Chung’s book [16] 
is an excellent source for a more complete treatment of the countable case. 

The literature concerning applications of Markov chains is enormous. 
Karlin’s book [86] has some nice examples; so does Feller’s text [59, Vol. I]. 
A. T. Bharucha-Reid’s book [3] is more comprehensive. 


The proof of Proposition 7.37 given in the first edition of this book was 
incorrect. I am indebted to P. J. Thomson and K. M. Wilkinson for pointing 
out the error and supplying a correction. 


CHAPTER 8 


CONVERGENCE IN DISTRIBUTION 
AND THE TOOLS THEREOF 


1. INTRODUCTION 


Back in Chapter 1, we noted that if Z, = Y, + ++ + Y,, where the Y, are 
independent and +1 with probability 4, then 


Z 1 z 2 
P|—= <x ==] et dt. 
(G ) fi Jw 
Thus the random variables Z,, Vn have distribution functions F,„(x) that 
converge for every value of x to ®(x), but from Problem 23, Chapter 3, 


certainly the random variables ZN n do not converge a.s. (or for that matter 
in L,, or in any strong sense). What are convergent here are not the values 
of the random variables themselves, but the probabilities with which the 
random variables assume certain values. In general, we would like to say that 
the distribution of the random variable X,, converges to the distribution of 
X if F(x) = P(X, < x) > F(x) = P(X < x) forevery x e R™, But this is a 
bit too strong. For instance, suppose X = 0. Then we would want the 
values of X, to be more and more concentrated about zero, that is for any 
€ > 0 we would want 


(8.1) P(X, < =) 70, P(X, > 0. 
Now F(0) = 0, but 8.1 could hold, even with F,(0) = 1, for all n. Take 
X, = —l/n, for example. What 8.1 says is that forall x < 0, F,(x) > F(x), 


and for all x > 0, F(x) — F(x). Apparently, not much should be assumed 
about what happens for x a discontinuity point of F(x). Hence we state the 
following: 


Definition 8.2. We say that X,, converges to X in distribution, X, os X, if 
F(x) — F(x) at every point x e C(F), the set of continuity points of F. That is, 
P(X = x) = 0> F,(x) > F(x). We will also write in this case F, 2, F. 
Different terminology is sometimes used. 

Definition 8.3. By the law of X, written £(X), is meant the distribution 
of X. Convergence in distribution is also called convergence in law and{(X,) > 
{(X) is equivalent notation for X, 2X If random variables X and Y have the 


same distribution, write either £(X) = L(Y) or X 2y, 
159 
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Recall from Chapter 2 that a function F(x) on R® is the distribution 
function of a random variable iff 


(8.4) i) x < y => F(x) < FQ) 

ii) x Î y => Fx) T FO) 

iii) F(—œ)= 0, F(+œ)= 1. 
Problems 


1. Show that if F(x) = P(X < x), then F(x*) — F(x) = P(X = x). Show 
that C°(F) is at most countable (F(x+) = lim, |, F(y)). 

2. Let T be a dense set of points in R®, F(x) on T having properties (8.4 
i, ii, and iii) with x, y € T. Show that there is a unique distribution function 
F(x) on R® such that F(x) = F(x), x € T. 


D 
3. Show that if, for each n, X, takes values in the integers 7, then X, —> X 
implies P(X € J) = 1 and X, >> X <> P(X, = j) > P(X =j) all je 1. 
D 
4. If F,, F are distribution functions, F, —> F, and F(x) continuous, show 


that 
sup |F,,(x) — F(x)| —> 0. 


D D 
5. If X = Y, and g(x) is B,-measurable, show that ¢(X) = (Y). Give an 
example to show that if X, Y, Z are random variables defined on the same 


D D 
probability space, that X = Y does not necessarily imply that XZ = YZ. 
Define a random variable X to have a degenerate distribution if X is a.s. 
constant. 


D 
6. Show that if X, —> X and X has a degenerate distribution, then X, ee 


2. THE COMPACTNESS OF DISTRIBUTION FUNCTIONS 


One of the most frequently used tools in D-convergence is a certain compact- 
ness property of distribution functions. They themselves are not compact, 
but we can look at a slightly larger set of functions. 


Definition 8.5. Let M be the class of all functions G(x) satisfying (8.4 i and ii), 
with the addition of 


(ii) O0<G(—~), Gl+0) <1. 
As before G, G, € M, G, —> G if lim G,(x) = G(x) at all points of C(G). 


Theorem 8.6 (Helly-Bray). M is sequentially compact under A 
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Proof. Let G, € M, take T = {x,}, k = 1,2,... dense in R®. We apply 
Cantor’s diagonalization method. That is, let 1, = {n,, na, . . .} be an ordered 
subset of the integers such that G,,(x,) converges as n — œ through J,. Let 
I, < I, be such that G,(x,) converges as n — œ through /,. Continue this 
way getting decreasing ordered subsets J, J,... of the integers. Let nm 
be the mth member of J,,. For m > k, nm € Ip, so for every x, E€ T, Ga (Xx) 
converges. Define Go(x) on T by Gox) = limG, (x,). Define G(x) on 
R® by m 
G(x) = in Go(x;,)- 


It is easy to check that Ge M. Let xe C(G). Then 
lim Go(x,) = G(x), 


ata 


by definition, but also check that 
l lim G(x) = G(x). 


arte 


Take x, < x < Xis Xp X, E T. Then 
Cn) S Cnn) L Grp), 
implying 36: 
G(x.) < lim G, (x) < Tim G, (0) < Gii). 
Letting x, tx, x, | x gives the result that G,, converges to G at every x € C(G). 
A useful way of looking at the Helly-Bray theorem is 


Corollary 8.7. Let G, € M. Ifthere is a G € M such that for every D-convergent 
subsequence Gn , Gn = G, then the full sequence G, 226: 


D 
Proof. If Ga+> G, there exists an x) € C(G) such that G,(xo) +> G(X). 
But every subsequence of the G, contains a convergent subsequence G, „and 
G,,,(%0) > G(Xo)- 


——$—$—$—$< Fig. 8.1 FG). 


Unfortunately, the class of distribution functions itself is not compact. 
For instance, take X, = n (see Fig. 8.1). Obviously lim, F(x) = 0 identi- 
cally. The difficulty here is that mass floats out to infinity, disappearing in the 
limit. We want to use the Helly-Bray theorem to get some compactness 
properties for distribution functions. But to do this we are going to have to 
impose additional restrictions to keep the mass from moving out to infinity. 
We take some liberties with the notation. 
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Definition 8.8. F(B), B € B, will denote the extension of F(x) to a probability 
measure on By, that is, if F(x) = P(X < x), F(B) = P(X € B). 


Definition 8.9. Let N denote the set of all distribution functions. A subset 
£ c N will be said to be mass-preserving if for any « > 0, there is a finite 
interval I such that F(I°) < «, all Fef. 

Proposition 8.10. Let £ = N. Then £ is conditionally compact in N if and 
only if £ is mass-preserving (that is, F, € £ = IF, „such that F,, 2, Fe N). 


Proof. Assume £ mass-preserving, F, e£. There exists G € M such that 


Pas a G. For any e, take a, b such that F,(la, b)) >1—e. Takeda’ <a, 
b > b so that a’, b’ e C(G). Then 
Fo l, b) = Fn (O) — Fa (E) > Gb’) — Ga’), 


with the conclusion that G(b’) — G(a’) > 1 — e, or G(+ œ) = 1, G(— œ) = 0, 
hence Ge N. On the other hand, let £ be conditionally compact in N. 
If £ is not mass-preserving, then there is an e > 0 such that for every finite 
I, 

inf FI) < 1 — €. 

Fe£ 


Take F, E£ such that for every n, F,([—n, +n)) < 1 — e. Now take a 


subsequence F, 2, FeeX. Let a,b e C(F); then F,, (Ia, b)) + F([a ,b)), 
but for nm sufficiently large [a, b) € [—1,,, HAm). Thus F (ia, b)<l-—e 
for any a, b e C(F) which implies F ¢ N. 


One obvious corollary of 8.10 is 


Corollary 8.11. If F, 2 F, FEN, then {F,} is mass-preserving. 


Problems 


7. For —œ <a < b < +00, consider the class of all distribution functions 
such that F(a) = 0, F(b) = 1. Show that this class is sequentially compact. 


8. Let F, Balt F, and F,, F be distribution functions. Show that for B, 
any Borel set, it is not necessarily true that F,(B) = 1, for all n => F(B) = 1. 
Show that if B is closed, however, then F,(B) = 1, for all n => F(B) = 1. 
9. Let g(x) be &,-measurable, such that Ig(x)| >oasx >to. If 
£ c N is such that sup f |g| dF < œ, then £ is mass-preserving. 

Fel 


10. Show that if there is an r > 0 such that lim E|X,|" < œ, then {F,} 
is mass-preserving. 


11. The support of F is the smallest closed set C such that F(C) = 1. 
Show that such a minimal closed set exists. A point of increase of F is 
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defined as a point x such that for every neighborhood N of x, F(N) > 0. 
Show that the set of all points of increase is exactly the support of F. 
12. Define a Markov chain with stationary transition probabilities p(- | x) to 
be stable if for any sequence of initial distributions 7, D-converging to an 
initial distribution z, the probabilities f p(- | x)z,(dx) D-converge to the prob- 
ability f pC | x)m(dx). 

If the state space of a stable Markov chain is a compact interval, show that 
there is at least one invariant initial distribution. [Use Problem 7 applied to 
the probabilities 1/n $7 p(- | x) for x fixed.] 


3. INTEGRALS AND 2-CONVERGENCE 


Suppose F, ae F, Fa, F€ NÑ, does it then follow that for any reasonable 
measurable function f(x), that f f(x) dF, — f f(x)dF? The answer is 
No! For example, let 


X, =1/n,X=0, so F, >F. 


Now take f(x) = 0, x < 0, and f(x) = 1, x > 0. Then ffdF,„, = 1, but 
ffdF =0. But it is easy to see that it works for f bounded and continuous. 
Actually, a little more can be said. 


Proposition 8.12. Let F,, F€ N and F, >> F. If f(x) is bounded on R®, 
measurable B, and the discontinuity points of f are a set S with F(S) = 0, then 


Í ie Í fF. 
Remark. The set of discontinuity points of a 8,-measurable function is in 
B,, so F(S) is well-defined. (See Hobson [72, p. 313].) 


Proof. Take a, b € C(F), h,..., 4, a partition $, of J = [a, b), where I, = 
[a,, 5,) and a,, b; E C(F). Define on J 


fFO= 2s p (FOr), 


k 
2 sup 
KO =È inf (AOO) 
Then i 
f fc dF, < [far < if fe dF, 


Clearly, the right- and left-hand sides above converge, and 


[fear < lim | sar, < Tm | sar, < | ff aF. 
I I I f 
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Let |,| ~ 0. At every point x which is a continuity point of f(x), 
FEOT fF) Lf). 
By the Lebesgue bounded convergence theorem, 


lim | fe dF = lim | fp dF =Í fdF. 
k I k I I 


Let M = sup | f(x)|, then since {F,} is mass-preserving, for any « > 0 we can 
take J such that F,(/°) < «/2M and F(I*) < «/2M. Now 


frar,- fjar |<] f far, = f sar 


Corollary 8.13. In the above proposition, eliminate the condition that f be 
bounded on R®, then 


lim +eS=eE. 


Í If] dF < lim Í If} dF, 


Proof. Define 
FO) if I&I < a, 
8x) = : 
a, otherwise. 
Every continuity point of fis a continuity point of g,. Apply Proposition 
8.12 to g, to conclude 
fe: dF = lim f8, dF, < tim fisi dF, 


Let «fo, then g,ÎIfl. By the monotone convergence theorem 
Í 8a dF > J |f | dF. 


Problems 


13. Let F,, FEN, F, -2> F. For any set E © R™, define the boundary of 
E as bd(E) = E ^ E’, (E = closure of E). Prove that for any B € B, such 
that F(bd(B)) = 0, F,(B) > F(B). 
14. Let F, 2 F, and h(x), g(x) be continuous functions such that 

|g(x)| — +c as x> +o, 

| He 

g(x) 
Show that lim f |g(x)| dF, < œ implies 


f h(x) dF, > J h(x) dF. 


—>Q0 as x> +o. 
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4. CLASSES OF FUNCTIONS THAT SEPARATE 


Definition 8.14. A set & of bounded continuous functions on R" will be called 
N -separating if for any F, GEN, 


[sar =| fdc, all feg 
implies F = G. 

We make this a bit more general (also ultimately, more convenient) by 
allowing the functions of & to be complex-valued. That is, we consider 
functions f(x) of the form f(x) = f(x) + ife(), fi, fe real-valued, continuous, 
and bounded. Now, of course, |f(x)| has the meaning of the absolute value 


of a complex number. As usual, then, f fdF = f fi dE + iff. dF. 
The nice thing about such a class & of functions is that we can check 


whether F, E by looking at the integrals of these functions. More 
specifically : 

Proposition 8.15. Let & be N-separating, and {F,,} mass-preserving. Then 
there exists an F € N such that F, 2. F if and only if 


lim Í fdF, exists, all fes. 
If this holds, then lim f f dF, = f f dF, all f € &. 


Proof. One way is clear. If F, -2> F, then lim f fdF, = f fdF by 8.12. 


To go the other way, take any D-convergent subsequence F,,, of F,. By 
mass-preservation F,, , FEN. Take any other convergent subsequence 
Fy, —> G. Then for f € £, by 8.12, 


lim [tar =| far, lim f fare =| fa, 
k k 
so f fdF = | fdG, all fe & = F = G. All D-convergent subsequences of 
F,, have the same limit F, implying F, BF, 
Corollary 8.16. Let & be N-separating and {F,,} mass-preserving. If Fe N 
is such that § f dF, > | f dF, all f € 8, then F, > F. 


The relevance of looking at integrals of functions to D-convergence can be 


clarified by the simple observation that F, 2, Fis equivalent to f f dF, > 
f {dF for all functions f of the form 


1, x < xX, 
fœ) = j 
0, x > Xp, 

for any x € C(F). 
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What classes of functions are N’-separating? Take &, to be all functions 
J of the form below (see Fig. 8.2) with any a, b finite and any « > 0. 


Proposition 8.17. &, is N’-separating. 
Proof. For any F, G € N, take a,b € C(F) A C(G). Assume that for any f 


as described, 
f faF = f fdG. 


F({a, b)) < [sar =| fac < G(la — e, b +8), 


and conversely, 


Then 


G([a, b)) < F(fa — «<, b + «)). 


Let e | 0, to get F((a, b)) = G(la, b)). The foregoing being true for all 
a, b € C(F) N C(G) implies F = G. 


1 


o a-e a b bte x Figure 8.2 


However ĉis an awkward set of functions to work with. What is really 
more important is 


Proposition 8.18. Let & be a class of continuous bounded functions on R® with 
the property that for any fo E o, there exist f, € & such that sup | f,(x)| < M, 
all n, and lim f,(x) = fo(x) for every x e R®. Then & is N-separating. 


Proof. Let § fdF = f f dG, all fe &. For any fy € &), take f, E & converging 
to fy as in the statement 8.18 above. By the Lebesgue bounded convergence 
theorem 


lim f f, dF =| fea, lim Í j, dG = f fac, 
so f fa dF = $ fa dG, all fo E & implying F = G. 


5. TRANSLATION INTO RANDOM -VARIABLE TERMS 


The foregoing is all translatable into random-variable terms. For example: 


i) If X, are random variables, their distribution functions are mass- 
preserving iff for any e > 0, there is a finite interval J such that 


P(X, EI) < e, alln. 
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ii) If |g(x)| — œ as x + +00, then the distribution functions of X, are 
mass-preserving if sup E |g(X,)| < œ (Problem 9). 


n 
iii) If X, have mass-preserving distribution functions and & is an N- 
separating set of functions, then there exists a random variable X such 
that X, > X if and only if lim Ef(X,) exists, all f € 8. 


We will switch freely between discussion in terms of distribution functions 
and in terms of random variables, depending on which set of terms is more 
illuminating. 


Proposition 8.19. Let X, 2y X, and let g(x) be measurable B,, with its 
set S of discontinuities such that P(X € S) = 0. Then 
D 
P(X) —> (X). 
Proof. Let Z, = (Xn), Z = (X). If Ef(Z,) > Ef(Z), for all f € &, then 


Z, 2, Z. Let g(x) = f (v(x). This function g is bounded, measurable 
B, and continuous wherever g is continuous. By 8.12, Eg(X,„) — Eg(X). 


We can’t do any better with a.s. convergence. This is illustrated by the 
following problem. 


Problem 15. If (x) is as in 8.19, and X,, > X , then show 9(X,) ==> 9(X). 
Give an example to show that in general this is not true if g(x) is only 
assumed measurable. 


6. AN APPLICATION OF THE FOREGOING 


With only this scanty background we are already in a position to prove a 
more general version of the central limit theorem. To do this we work with 
the class of functions defined by 


& consists of all continuous bounded f on R™ such that f"(x) exists for all 
x, sup |f "O)| < œ, and f"(x) is uniformly continuous on R. 


It is fairly obvious that &, satisfies the requirements of 8.18 and hence is 
N-separating. We use &, to establish a simple example of what has become 
known as the “invariance principle.” 


Theorem 8.20. If there is one sequence X}, Xf, . . . of independent, identically 
distributed random variables, EX*¥ = 0, E(Xf{)? = o* < œ, such that 


* ae * 
Xı +: i+ Xn D x, 
a*n 


then for all sequences X,, Xz, . . . of independent, identically distributed random 


168 CONVERGENCE IN DISTRIBUTION AND TOOLS THEREOF 8.6 


variables such that EX, = 0, EX? = œ < œ, 
on 
Proof. Let f € &,, and define 
64) = sup I’ -— "ON. 
la—yl <h 


By definition lim 6(h) = 0 as h | 0. We may as well assume o* = o = |, 
otherwise we use X¥/o*, X,/o. Let 


Vn í Vn 


Since EZ? = 1, E(Z*)? = 1, both sequences are mass-preserving. By 8.15 


AE Ne DER BEA ge Xt tt Xe 


Ef(Z}) > Ef(X), all fe&, 
and by 8.16 it suffices to show that 

Ef(Z,) > Ef(X), all fe&. 
Since only the distributions are relevant here, we can assume that X*, X 
are defined on a common sample space and are independent of each other. 
Now write 
Xi t'i + Xp + Xn Xr +++ Xa t+ Xs 
pil ER ESAR i a ES -f os eR EAE e LGA 

Vn 

atot Met) (Eat a) 


+4( nF a 


f(Z,) — f(Z*) = s( 


n 


* 


x S E E 4X x* a Ee CE 
4 ec a. = eer em | 
Vn Vn 
Define random variables V,,U, by 
(Kati tt Ke t+ Me bo OG 
V; =f W ees en Ne ee 
Vn 


(S) 
~f 1mm 
Vn 

Xi toss + Xert Xe t+ Xe 


Vn 


U, = 
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Use Taylor’s expansion around U, to get 


can vessat) (oe 8) 


258 Le (u: +3) 


where 9, 6* are random variables such that 0 < 6,6* < 1. Both X, X* are 
independent of U,, so 


E((Xe — Xk) f (Uk) =0, (F'O < co + calxl > Elf’ (UD! < &). 


(xig (v, +%)) = (EXD(EF' UD) + a8 (xt (H), el < 1. 


Let h,(x) = x? A(x|/V/n). Take the expectation of (8.21) and use EX? = 
E(X})? to get 


LEVA] < (1/2n)Eh,(X,) + Eh XK = (1/2n)[Eh,O%) + Eh X], 
this latter by the identical distribution. Note that 


S(Z,) — SZ} =V, Apes +V 
so 
LEF (Za) — Ef (ZO < BEA, (X1) + 4ER, (XT) 


Let M = sup, |f"(x)|;_ then 6(h) < 2M, all h, so h,(X,) < 2MX?. But 
ha(Xı) + 0 a.s. Since X? is integrable, the bounded convergence theorem 
yields Eh,(X,) > 0. Similarly for Eh,(Xf). Thus, it has been established 
that 
Ef(Z,) — Ef(Z)| +0 
implying 
Ef(Z,) > Ef (X). Q.E.D. 


This proof is anachronistic in the sense that there are much simpler 
methods of proving the central limit theorem if one knows some more 
probability theory. But it is an interesting proof. We know that if we take 
X7, Xf, ... to be fair coin-tossing variables, that 


XP t+ Xe. ye 1) 
Jn 3 3 
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where the notation. V(0,1) is clarified by 


Definition 8.22. The normal distribution with mean u and variance o*, denoted 
N (u, o), is the distribution of a random variable oX + p, where 


1 4 2 
P(X < x)= =| edt. 
( ) a 27 0 
So we have proved 
Corollary 8.23. Let X,, X., ... be independent, identically distributed random 
variables, EX, = 0, EX? = o? < œ. Then 


Mf tM 2, wo, 
ofn se 


7. CHARACTERISTIC FUNCTIONS 
AND THE CONTINUITY THEOREM 


The class of functions of the form {e*“*}, u e R®, is particularly important 
and useful in studying convergence in distribution. To begin with 


Theorem 8.24. The set of all complex exponentials {e'"*}, u e R®, is N- 
Separating. 


Proof. Let fe“ dF = f e"? dG, all u. Then for a, k =1,...,m any 
complex numbers, and u,,..., Um real, 


(8.25) Í (z ae") dF= f (Zae me) dG. 


Let fo be in &, let €„ | 0, €„ < 1, and consider the interval [—n, +n]. Any 
continuous function on [—n, +n] equal at endpoints can be uniformly 
approximated by a trigonometric polynomial; that is, there exists a finite sum 


Ía = 5 aet" kein 
k 


such that | A(x) — fa) < €n, x €[—n, +n]. Since f, is periodic, and e, < 1, 
then for all n, sup, |f,(x)| < 2. By (8.25) above ff, dF = f fa dG. This 
gives f h dF = f fh dG or F =G. 


Definition 8.26. Given a distribution function F(x), its characteristic function 
f) is a complex-valued function defined on R® by 


fu) = f el*F(dx), 
If F is the distribution function of the random variable X, then equivalently, 


flu) = E eX, 
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Note quickly that 


Proposition 8.27. Any characteristic function f(u) has the properties 


i) f(0) =1, 
ii) [fol <1, 


iii) f(u) is uniformly continuous on R™, 


iv) f(—u) = fu). 
Proof 


i) Obvious; 


ii) 


[ewar| < fie dF = 1; 


iii) [f(u + h) — f) = ferea =) ar | <Í le" — 1| dF = (h), 


by the bounded convergence theorem 6(h) > 0 as h +0; 
iv) f(—u) = f(u) is obvious. 


Theorem 8.24 may be stated as: No two distinct distribution functions 
have the same characteristic function. However, examples are known (see 
Loéve [108, p. 218]) of distribution functions F, # F, such that f(u) = f(u) 
for all u in the interval [—1, +1]. Consequently, the set of functions {e**}, 
—1 < u < +1, is not N-separating. 

The condition that F, 2, F can be elegantly stated in terms of the 
associated characteristic functions. 


Theorem 8.28 (The continuity theorem). If F, are distribution functions with 
characteristic functions f,(u) such that 


a) lim f(u) exists for every u, and 
b) lim f,(u) = A(u) is continuous at u = 0, 
then there is a distribution function F such that F,, Dy F and h(u) is the char- 


acteristic function of F. 


Proof. Since {e'*} is N-separating and lim, f e*"*” dF, exists for every 
member of {e’**}, by 8.15, all we need to do is show that {F,} is mass- 
preserving. To do this, we need 


Proposition 8.29. There exists a constant 2,0 <a < œ, such that for any 
distribution function F with characteristic function f, and any u > 0, 


H(i J) < fone 
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Proof. Rif(u) = f cos ux F(dx), so 


ai — Rif(v)] dv = a fa — cos vx)F(dx) dv 


=|[t [ta — erm asa 
-f(r 
T rea iG 
z it -ei a) 
e= [h 
ee 


Now back to the main theorem. By the above inequality, 


jim F,(|— L, $ a < imf — Rif,(v)] do. 


Letting 


The bounded convergence theorem gives 


lim F,([— $ All )< efa — Rih(v)) dv. 


Now f,(0) = 1 = A(0) = 1. By continuity of h at zero, lim R/h (v) = 


Therefore, v0 
lim Tim F, (| - oe A ) Si, 
urd u u 


By this, for any « > 0 we may take a so that lim F,([—a, +a]°) < €/2. 
So there is an mg such that for n > m, F,({—a, +a]°) < «e. Take b >a 
such that F,([—6, +6]°) < e for k = 1,2,..., Ao From these together 
sup F, ([—6, +4]°) < «. Q.E.D. 


Corollary 8.30. Let F,, be distribution functions, f,, their characteristic functions 
If there is a distribution function F with characteristic function f such that 


lim f,(u) = f(u) for every u, then F, >> F. 


Proof. Obvious from 8.28. 
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Clearly, if F, 2y F, then the characteristic functions f,(u) converge at 
every point u to f(u). We strengthen this to 


Proposition 8.31. If F, oe F, then the corresponding characteristic functions 
falu) converge uniformly to f(u) on every finite interval I. (Denote this kind 
of convergence by —“>). 


Proof. This result follows from the fact that the fọ, f form an equicontinuous 
family. That is, if we fix a finite interval 7, then for any n, u, and A, 


lfa + h) — f,(u)] < 


< sup je™* — 1] + 2F,(/°). 
wey 


Í (ethe _ ete) dF (x)| + 2F al’) 
I 


Thus, since the {F,,} are mass-preserving, 
sup [falu + h) — fao) < ò(h), 


where ô(h) | 0 ash | 0. Now the usual argument works: Divide 7 up into 
points u, uz, - . . , Um such that [uz — l < h. Forued, 


AD — SON < IF) — SAA + S — SUA + Sain) — Si), 


where u, is the point of the partition nearest u. Therefore 
lim sup |fa(u) — fu) < 28h), 
n uE 


because f(u) also satisfies | f(u + h) — f(u)| < òlh). Taking k +0 now 
completes the proof. 

The continuity theorem gives us a strong basic tool. Now we start 
reaping limit theorems from it by using some additional technical details. 


Problems 


16. A random variable X has a symmetric distribution if P(X € B) = 
P(X e —B), where —B = {—x; xe B}. Prove that the characteristic 
function of X is real for all u iff X has a symmetric distribution. 


17. A natural question is, what continuous complex-valued functions f(u) 
on R® are characteristic functions? Say that such a function is nonnegative 
definite if for any complex numbers 24, . . . , A,, and points u), . . . , u, E R™, 


> J(u: — uA]; > 0. 


i,j=1 


A complete answer to the question is given by the following theorem. 
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Bochner’s Theorem. Let f(u) be continuous on R, f(0) = 1. Then f is a 
characteristic function if and only if it is nonnegative definite. 
Prove that if fis a characteristic function, then it is nonnegative definite. 
(See Loéve [108, pp. 207 ff.] for a proof of the other direction.) 
18. Find the characteristic function for a Poisson distribution with 
parameter Å. 
19. Find the characteristic function of S, for coin-tossing. 
20. If Y = aX + b, show that 
fru) = e flau). 
21. A random variable X is called a displaced lattice random variable if 
there are numbers a, d such that 
+o 
X PX=nd+a)= 


Show that X is a displaced lattice if and only if there is a u 0 such that 
If) = 1. If u, u, are irrational with respect to each other, and 
| f/x) = | fx(u2)| = 1, show that X is a.s. constant, hence | f,(u)| = 1. Show 
that X is distributed on a lattice L,, d > O iff there is a u = O such that 


fx) = 


8. THE CONVERGENCE OF TYPES THEOREM 


Look at the question: Suppose that X,, ay X, and X is nondegenerate. Can 


we find constants a,, 6, such that a,X, + bn _?,. x’ where X’ has a law not 
connected with that of X in any reasonable way? For example, if X,, X.,.. . 
are independent and identically distributed, EX, = 0, EX? < œ, can we find 
constants A, such that S,/A, D-converges to something not N(u, o)? And 
if S,,/4, D-converges, what can be said about the size of À, compared with 
Vn n, the Norman factor we have been using? Clearly, we cannot get the 
result that X,, as X, a,X%, + 5, Bi X’ implies lim a, = a exists, because 
if X, has a symmetric distribution, then X, 2y X = (—"X, Dy X, 
since — X, and X, have the same law. But if we rule this out by requiring 
a, > 0, then the kind of result we want holds. 


Theorem 8.32 (Convergence of types theorem). Let X, a X, and suppose 


there are constants a, > 0, b, such that a,X%,, + bn ear X’, where X and X’ 
are nondegenerate. Then there are constants a, b such that (X°) = £(aX + b) 
and b,, —> b, a, — a. 


Proof. Use characteristic functions and let f, = f, so that 


Santara) = ef falanu). 
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By 8.31, iff’, f are the characteristic functions of X’,X respectively, then 


eiubn (Ant) => f'(u), falu) > f(u). 


Take n,, such that a, —> a, where a may be infinite. Since 


(Salad —> IFD N —> I, 
if a, — 0, substitute v, = u/a,, u E I, to get 


fn) = falantan) > If O. 
Thus | f(u)| = 1, implying X degenerate by Problem 21. Hence a is finite. 
Using we-convergence FACA u)| — |f (au); thus [f’(u)| = | f(au)|. Suppose 
an, > A, Ani, > a’ anda is a’, Use | f(au)| = |f(a’u)|, assume a’ < a, so 
If (u)| = | f (a /a)u)| = = |f((a'/a)"u)| by iterating N times. Let 
N — œ to get the EOE T |f@| = 1. Thus there is a unique a > 0 
such that a, > a. So f,(a,u) > f(au). Hence e**» must converge for every 
u such that f(au) #0, thus in some interval |u| < ô. Obviously then, 
lim \b,| < 00, and if b, 5’ are two limit-points of b,, then e"? = e*t for all 
|u| < ô, which implies b = 5’. Thus b, — b, e?n — e®, and f' (u) = ef (au), 


9. CHARACTERISTIC FUNCTIONS AND INDEPENDENCE 


The part that is really important and makes the use of characteristic functions 
so natural is the multiplicative property of the complex exponentials 
and the way that this property fits in with the independence of random 
variables. 

Proposition 8.33. Let X,, X2,..., X„ be random variables with characteristic 
functions f,(u),..., f,(u). The random variables are independent iff for all 


Uis.. -y Ups 
E (exp [i i> u,X |) = TT fais. 


Proof. Suppose X, Y are independent random variables and f, g are complex- 
valued measurable functions, f = fi + ifa g = 21 + iga, and fi, fos 21,22 are 
&,-measurable. Then I assert that if E|f(X)| < œ, Elg(Y)| < ©, 


(8.34) E(f) = (EfO0)(Es™), 


so splitting into products does carry over to complex-valued functions. 
To show this, just verify 


EAM + FOO) + ig) 
= Ef(Xg(Y) + EROE) + EAEL) — BOXe,(Y). 
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All the expectations are those of real-valued functions. We apply the ordinary 
result to each one and get (8.34). Thus, inducing up to n variables, conclude 
that if X,, X,,..., X, are independent, then 


(II cunt) par II Ee”, 
1 1 
To go the other way, we make use of a result which will be proved in Chapter 
11. If we consider the set of functions on R™, 
{et(uazit i tunta), all (u,,... 5 Ua), 


then these separate the n-dimensional distribution functions. Let F,(x) be 
the distribution function of X,,..., X,; then the left-hand side of the equa- 
tion in proposition 8.33 is simply 


| exp [iux +*+- + uyx,))F (dx). 


But the right-hand side is the integral of 


eflet ++ tunza) 


n 
with respect to the distribution function [] F,(x,). Hence F(x... , X,) = 


n 1 
[I F(x), thus establishing independence. 
1 


Notation. To keep various variables and characteristic functions clear, we 
denote by f,(u) the characteristic function of the random variable X. 


Corollary 8.35. If X,,...,%, are independent, then the characteristic 
function of S, = X, + +°+ + X, is given by 


fka) = II F(t). 


The proof is obvious. 

See that X,, X, independent implies that Eetvtx) = fy (u)fx (u). 
But having this hold for all u is not sufficient to guarantee that X,, X, are 
independent. (See Loéve [108, p. 263, Example 1].) 


Recall that in Chapter 3, we got the result that if X,, X,,... are 
independent, $7 X, converges a.s. iff X7 X, converges in probability, hence 
iff $* X, Z> 0. The one obvious time that —» and -Ž> coincide is when 
Yn Zs c< Y, 2y {(degenerate at c). This observation will lead to 


8. D 
Proposition 8.36. For X,, X,, . . . independent, $+ X, 2 if >? Xe —, 
because for degenerate convergence, we can prove the following proposition. 


8.10 FOURIER INVERSION FORMULAS 177 


Proposition 8.37. If Y,, are random variables with characteristic functions 
f(t), then Y, >> 0 iff f (u) > 1 in some neighborhood of the origin. 


Proof. One way is obvious: Y,, 2,0 implies f„(u)—> 1 for all u. Now 
let fa (u) — 1 in [—6, +ò]. Proposition 8.29 gives 


Pli) <5 fa- rona 


The right-hand side goes to zero as n — œ, so the F, are mass-preserving. 


Let n’ be any subsequence such that F, 2, F. Then the characteristic 
function of F is identically one in [—6, +ô], hence F is degenerate at zero. 
By 8.7, the full sequence F,, converges to the law degenerate at zero. 

This gives a criterion for convergence based on characteristic functions. 
Use the notation f,(u) = fx,(u). 


Theorem 8.38. >) Xey iff T] f.{u) converges to h(u) in some neighborhood 
1 1 
N of the origin, and \h(u)| > 0 on N. 


Proof. Certainly > X, “*+ implies [] f,(u) converges everywhere to a 
1 1 


characteristic function. To go the other way, the characteristic function 
n n n n 
of ¥ X, is [J A0). Because [] f(u) — hlu) ¥ 0 on N, [Tf — 1 on N. 


m k=m 1 m 
Use 8.37 to complete the proof, and note that 8.36 is a corollary. 


Problems 


22. For Y,, Yə, ... independent and +1 with probability 3, use 8.38 to 
show that 2 c,Y, converges a.s. <> Lc? < œ. 


23. Show that the condition on f,(u) in 8.38 can be partly replaced by—if 
dP |l ~ fw converges in some neighborhood N of the origin, then 


LIX >. 


10. FOURIER INVERSION FORMULAS 


To every characteristic function corresponds one and only one distribution 
function. Sometimes it is useful to know how, given a characteristic function, 
to find the corresponding distribution function, although by far the most 
important facts regarding characteristic functions do not depend on knowing 
how to perform this inversion. The basic inversion formula is the Fourier 
transform inversion formula. There are a lot of different versions of this; 
we give one particularly useful version. 
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Theorem 8.39. Let f(u) be the characteristic function of a distribution function 
F(dx) such that 


Í [f(u)| du < 00. 


Then F(dx) has a bounded continuous density h(x) with respect to Lebesgue 
measure given by 


1 iur 
(8.40) n(x) => | f(y) du. 


Proof. Assume that (8.40) holds true for one distribution function G(dx) 
with density g(x) and characteristic function (uw). Then we show that it holds 
true in general. Write 


as frp du = as fetter dF(x) du. 
2a 2a 
Then, interchanging order of integration on the right: 
1 , 
8.41) Fe [renee ran = [ey ~ AFC 


If X has distribution F(dx), and Y has distribution G(dx), then the integral 
on the right is the density for the distribution of X + Y where they are taken 
to be independent. Instead of Y, now use «€Y, in (8.41), because if the distri- 
bution of Y satisfies (8.40), you can easily verify that so does that of eY, 
for « any real number. As e — 0 the characteristic function y(u) of eY 
converges to one everywhere. Use the bounded convergence theorem to 
conclude that the left-hand side of (8.41) converges to 


1 —iuy 
hO) = z [fer du. 


The left-hand side is bounded by f |f(u)| du for all y, so the integral of the 
left-hand side over any finite interval 7 converges to 


[ hoy ay. 


If the endpoints of J are continuity points of F(x), then since f(X + eY) > 
{(X), the right-hand side of (8.41) converges to F(/). Thus the two measures 
F(B) and fz h(y) dy on B agree on all intervals, therefore are identical. The 
continuity and boundedness of A(x) follows directly from the expression 
(8.40). To conclude, all I have to do is produce one G(x), p(u) for which 
(8.40) holds. A convenient pair is 


(8.42) doste ae TF 


To verify (8.42) do a straightforward contour integration. 


8.11 MORE ON CHARACTERISTIC FUNCTIONS 179 
We can use the same method to prove 


Proposition 8.43. Let y,(u) be any sequence of characteristic functions con- 
verging to one for all u such that for each n, 


f lpa(u)| du < 00. 


If b and a are continuity points of any distribution function F(x), with char- 
acteristic function f(u), then 


e tub 


FO) — F(a) = tim f (SS) pa(u) fu) du 


Iu 


Proof. Whether or not F has a density or f(u) is integrable, (8.41) above 
still holds, where now the right-hand side is the density of the distribution of 
X + Yn, X, Y„ independent, ¢,(u) the characteristic function of Y,. Since 


Pral) > 1, Yp as 0, £(X + Y,) > £(X). The integral of the right-hand side 
over [a, b) thus converges to F(b) — F(a). The integral of the left-hand side 


18 
Al SS) pale flu) du. 


This all becomes much simpler if X is distributed on the lattice L4, 
d>0. Then 


fw = F e4P(X = nd), 


n=—%V 


so that f(u) has period 27/d. The inversion formula is simply 
d +r/d g 
P(X = nd) = — Í e**"4f(yu) du 
2a —a/d 


Problem 24. Let X,, X;,... be independent, identically distributed integer- 
valued random variables. Show that their sums are recurrent iff 


+7 1 
lim Í —— du = œ, 
rti J-r 1 — rf(u) 


where f(u) is the common characteristic function of X4, Xa... 


11. MORE ON CHARACTERISTIC FUNCTIONS 

There are some technical results concerning characteristic functions which we 
will need later. These revolve around expansions, approximation, and similar 
results. 
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Proposition 8.44. If E |X\* < œ, then the characteristic function of X has the 
expansion 
k-13,,\/ -Ak 
)=5 a EX! + Ex! + 4(u)), 
where 6(u) denotes a function of u, such that lim 6(u) = O, and satisfying 
[ô(u)| < 3E |X|* for all u. uo 


Proof. Use the Taylor expansion with remainder on sin y, cos y for y real 
to get 


© (iy)? 4! ue 


k 
cosy + isin y =} = = (cos 6,y + isin 0y), 
o j: 


where 0., 9, are real numbers such that < 1, [ĝa < 1. Thus 
kG; j : k 
x= yen $ E (cos (@,uX) + isin (0,uX) — 1]. 
© j ! 


Now @,, 6, are random, but still |@,| < 1, |@.| < 1. Now 
|X*[cos (8,uX) — 1 + isin (8.uX)]| < 3 |X|, 


which establishes |6(u)| < 3E|X|*. Use the bounded convergence theorem 
to get 


lim 6(u) = lim E |X*[cos (0,uX) — 1 + isin (@,uX)]| = 0. 
urd und 
Another point that needs discussion is the logarithm of a complex 
number. For z complex, log z is a many-valued function defined by 
(8.45) z = eloez, 


For any determination of log z, log z + 2n7i, n = 0, +1,... is another 
solution of (8.45). Write z = re; then log z = log r + i6. We always 
will pick that determination of 6 which satisfies —m < 0 < m, unless we 
state otherwise. With this convention, log z is uniquely determined. 


Proposition 8.46. For z complex, 
log (1 + z) = z(1 + €(z)), 
where \e(z)| < |z| for |z| < 4. 


Proof. For |z| < 1, the power series expansion is 


ge Bh 28 z. 2 
log (1 = z — — AA Er 1-—-=4+-—-—---}. 
gl +z)=2z ae rics 2( ae ) 
For |z| < 4, 
z z? | izj 2 iz 1 
Hip sna < El + +j < zl]. 
| oar So, Ur Ie |z| ) sae |z| 
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One remark: Given a sequence /,,(u) of characteristic function, frequently 
we will take /,(u) = log f,(u), and show that /,(u) — g(u) for some evaluation 
of the log function. Now /,(u) is not uniquely determined. 


lu) = 1,(u) + 27iN,(u), 


N,{u) integer-valued, is just as good a version of log f„(u). However, if 
l, (u) — (u), and plu) is continuous at the origin for one evaluation of /,(w), 
then because f (u) = e’»'“) — e?) the continuity theorem is in force. 


12. METHOD OF MOMENTS 


Suppose that all moments of a sequence of distribution functions F, exists 
and for every integer k > 0, the limit of 


Í x*F (dx) 


exists. Does it follow that there is a distribution F such that F, oun F? Not 
necessarily! The reason that the answer may be “No” is that the functions 
x* do not separate. There are examples [123] of distinct distribution functions 
F and G such that f |x|" dF < œ, f |x|* dG < œ for all k > 0, and 


fear = fx* ac, k=0,1,... 


Start to argue this way: If lim f x? dF, < œ, then (Problem 10) the {F,} 
are mass-preserving. Take a subsequence F,,, 2, F. Then (Problem 14) 


lim fe dF, =|* dF, 
y 


so for the full sequence 


(8.47) lim Í dF, = | x* dF. 


n 


If there is only one F such that (8.47) holds, then every convergent subse- 
quence of F,, converges to F, hence F, 2y F. Thus 


Theorem 8.48. If there is at most one distribution function F such that 
lim [rar =|* dF, 


The question is now one of uniqueness. Let 


then F, 2y F. 


Ur = lim fx dF,,. 
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If F is uniquely determined by (8.47), then the moment problem given by 
the u, is said to be determined. In general, if the 4, do not grow too fast, 
then uniqueness holds. A useful sufficient condition is 


Proposition 8.49. If 
i ik 


lim 
k 


a |e 
< œ, 
k 


then there is at most one distribution function F satisfying 
RE f x*F(dx). 


Proof. Let 


jens 1/k 
r = lim laal : 


then for any « > O and k > ko, using the even moments to get bounds for 
the odd moments, 

fixtar < k(r +e)! 
Hence, by the monotone convergence theorem, 


n k 
few dF = lim SEL fiat ar < 0, 
n 0 ' 


for |E] < 1/re. Consider 
(z) = f e”F(dx). 
By the above, ọ(z) is analytic in the strip |Ri/z| < 1/re. For |z| < I/re, 


2* 


! 


(8.50) Wz) = Bee 

This holds for any distribution function F having moments uy. Since ¢(z) 
in the strip is the analytic continuation of p(z) given by (8.50), then ¢(z) is 
completely determined by u,. But for Riz = 0, (z) is the characteristic 
function and thus uniquely determines F. 


13. OTHER SEPARATING FUNCTION CLASSES 


For restricted classes of distribution functions, there are separating classes of 
functions which are sometimes more useful than the complex exponentials. 
For example, consider only nonnegative random variables; their distribution 
functions assign zero mass to (— œ, 0). Call this class of distribution func- 
tions M+. 
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Proposition 8.51. The exponentials {e~**}, A real and nonnegative, separate 
in M”. 

Proof. Suppose F and G are in A* and for all 2 > 0, 


Í e*F(dx) = f &*"G(dx). 


Then substitute e~* = y, so 


(8.52) fr dF(—log y) =fr dG(—log y). 
0 0 


In particular (8.52) holds for å ranging through the nonnegative integers. 
Thus for any polynomial P(y), 


Í P(y) dF(—log y) = Í P(y) dG(—Iog y), 


hence equality holds for any continuous function on [0, 1]. Use an approxi- 
mation argument to conclude now that F = G. 
As before, if F, € Mt and f e-**F,(dx) converges for all A > 0, then 


there is at most one distribution function F such that F, —> F. Let the 
limit of f e~** dF,(x) be A(A). Then by the bounded convergence theorem, 


lim m f = =) dF,(x) = a hA) dà. 


So conclude, just as in the continuity theorem, that if 


lim A(A) = 1, 
Ato 


then the sequence {F,,} is mass-preserving. Hence there is a unique distri- 


bution function F such that F, —> F. 
For X taking on nonnegative integer values, the moment-generating 
function is defined as 


(z) = 


for z complex, |z| < 1. 


Problem 25. Prove that the functions z7, |z| < 1 are separating in the class 
of distribution functions of nonnegative integer-valued random variables. If 
{X,} are a set of such random variables and 


%,{Z) = 
converges for all |z| < 1 to a function continuous at z = 1, then show there 
is a random variable X such that X, 2> X. 
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NOTES 


More detailed background on distribution functions, etc., can be found in 
Loéve’s book [108]. For material on the moment problem, consult Shohat 
and Tamarkin [123]. For Laplace transforms of distributions f e-+ dF(x) 
see Widder [140]. Although the central limit theorem for coin-tossing was 
proved early in the nineteenth century, a more general version was not 
formulated and proved until 1901 by Lyapunov [109]. The interesting proof 
we give in Section 6 is due to Lindeberg [106]. 

An important estimate for the rate of convergence in the central limit 
theorem is due to Berry and Eseen (see Loéve, [108, pp. 282 ff.]). They prove 
that there is a universal constant c such that if S, = X, +--:+ X, isa 
sum of independent, identically distributed random variables with EX, = 0, 
EX? = a? < 0, E |X, < œ, and if (x) is the distribution function of the 
(0, 1) law, then 


Sn E IX, [° 
P| = < —® =. 
G x) -= 00) ER 
It is known that c < 4, (Le Cam [99]) and unpublished calculations give 


bounds as low as 2.05. By considering coin-tossing, note that-the 1 [Vn rate 
of convergence cannot be improved upon. 


<c 


sup 
x 


CHAPTER 9 


THE ONE-DIMENSIONAL 
CENTRAL LIMIT PROBLEM 


1. INTRODUCTION 


We know already that if X,, X,,... are independent and identically dis- 
tributed, EX, = 0, EX? = o? < œ, then 


X+ +X, 
A D 0T): 
ovn 


Furthermore, by the convergence of types theorem, no matter how S, is 


normalized, if S,/A,, -2> then the limit is a normal law or degenerate. So 
this problem is pretty well solved, with the exception of the question: Why 
is the normal law honored aboveall otherlaws? From here there are a number 
of directions available; the identically distributed requirement can be 
dropped. This leads again to a normal limit if some nice conditions on 
moments are satisfied. So the condition on moments can be dropped; take 
X,, Xə... independent, identically distributed but EX} = œ. Now a 
new class of laws enters as the limits of S,/A, for suitable 4,, the so-called 
stable laws. 

In a completely different direction is the law of rare events, convergence 
to a Poisson distribution. But this result is allied to the central limit problem 
and there is an elegant unification via the infinitely divisible laws. Throughout 
this chapter, unless explicitly stated otherwise, equations involving logs of 
characteristic functions are supposed to hold modulo additive multiples of 
2ri. 


2. WHY NORMAL? 


There is really no completely satisfying answer to this question. But consider, 
if X,, Xz, . . . , are independent, identically distributed, and if 


what are the properties that X must have? Look at 
_ Xt e hat Xm te Xa 
v2n 


185 


Zon 
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Now Za, >> X. But 


Aoa Ra Pe ae Zi, + Z7). 

ann Ss Vid sa 

The variables Z’, Z” are independent, and Z; as XZ, 2, X, since they 
have the same distributions as Z,. This (we would like to believe) implies that 
X has the same distribution as 1 W2 (X’ + X”), where X’, X” are independent 
and have the same distribution as X. To verify this, note that 


EettZin = (Eet"Za V2) FetuzZal V2) 


ful) = Ia ZV (75): 
Since f,(u) „y f(u), where f(u) is the characteristic function of X, it follows 
that f(u) =f (uj 2 2)?. But the right-hand side of this is the characteristic 
function of (X’ + X")/x/2. So our expectation is fulfilled. Now the point is: 
Proposition 9.1. If a random variable X satisfies EX? < œ, and 
DX’ +X” 
x= A > 
where X', X” are independent and £(X) = £(X') = £(X"), then X has a N (0, o?) 
distribution. 
Proof. The proof is simple. Let X,, X,,... be independent, £(X,) = £(X). 


EX must be zero, since EX = (EX’ + EX")//2 implies EX = V2EX. 
By iteration, 


or 


Xt +X 
D “1 n 
X = —— =, n= 2", 
Vn 
But the right-hand sums, divided by g, converge in distribution to (0, 1). 


Actually, this result holds without the restriction that EX? < œ. A 
direct proof of this is not difficult, but it also comes out of later work we will 
do with stable laws, so we defer it. 


3. THE NONIDENTICALLY DISTRIBUTED CASE 

Let X,, X,,... be independent. Then 

Theorem 9.2, Let EX, = 0, EX? = oœ < œ, E |X}| < œ, ands? = 

if 

(9.3) im + 
n 


3 
n 


mvs 
ab 


na 


E 1X,1* = 0, 


N, 


9.3 THE NONIDENTICALLY DISTRIBUTED CASE 187 


then 
bis D 
Xit: +X, — NO, 1). 
Sn 
Proof. Very straightforward and humdrum, using characteristic functions. 
Let fy be fx, 2, the characteristic function of S,/s,. Then 


can ha) tit H 


Using the Taylor expansion, we get from (8.44) 
u u? u 
agara tag): 
2 
a 
<2’. 


ee 


Now (E |X2|)*/2 < E|X,|, or o? < E|X,|?. Then condition 9.3 implies that 
SUPgp cn Fk/Sn > 0. So suppen | f-(u/sn) — 1| +0 as n goes to infinity. 
Therefore use the log expansion 


log (1 + z) = 2(1 + 02), 
where |6| < 1 for |z| < 3, to get 


oes = (2) =) +93 (0(2) -F 


where the equality holds modulo 27i. Consider the second term above, 


Pan Ea ~ e — 


kon 
< sup f(*) —1 r 
kSn| \Sq E 
2 u 
< 2u? sup i(*) Si 
k&n Sn 


This bound goes to zero as n —> œ. Apply the Taylor expansion, 


Sn 


u uo u? 
f(=)-1--F3 +6528 mst, 
to the first term above to get 


n 2 u? % 
3(4(=) -1)=-T+ Serr, ist 
1 Sn 2 pART 


S 


which converges to —u?/2. 
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We conclude that 
Ean) > eur 


for every u. Since the theorem holds for identically distributed random vari- 
ables, it follows that e-“’/? must be the characteristic function of the (0, 1) 
distribution. Apply the continuity theorem to complete the proof. 


Note that we got, in this proof, the additional dividend that if X is 
N(0, 1), then 
Eet! = ev, 


4. THE POISSON CONVERGENCE 


For X,, Xa, . . . independent and identically distributed, EX, = 0, EX? = o, 
let 
M, = max (TBs... Pal), 
Jn Jn 
Write, for x > 0, 
P(M, < x) = P(X) < Vax,» Xp} < vax) 


= [P(X < Vaxy. 


Now 


PIXI 2 Vinx) < $ Í yt dF(y) = + 0,0), 


x“ Jii Vne 


where lim 6,(x) = 0. This leads to 


P(M, < x) > (: pe 500) my D, 
n 


the point being that 
lim P(M, < x) = 1, all x, 


or M, 2, 0. In this case, therefore, we are dealing with sums 
Jn Jn 
of independent random variables such that the maximum of the individual 
summands converges to zero. I have gone through this to contrast it to the 
situation in which we have a sequence of coins, 1, 2,..., with probabilities 
of heads p,, po,..., where pa —> 0, and the nth coin is tossed n times. For 
the nth coin, let X‘”) be one if heads comes on the kth trial, zero otherwise. 
So 
S = X{") tore eK” 
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is the number of heads gotten using the nth coin. Think!—the probability 
of heads on each individual trial is p, and that is going to zero. However, 
the total number of trials is getting larger and larger. Is it possible that S,, 
converges in distribution? Compute 


PS, = 0) = (1 — p,)”. 


This will converge if and only if np, > 2,0 <A< œ. If A= 0, then 
P(S, = 0) — 1, and henceforth rule this case out. 

For 2 > 0, P(M, = 0) = P(S, = 0) > e~}, so M, Bes Take charac- 
teristic functions, noting that 


Eee = Pre + qr- 
For n sufficiently large, these are close to one, for u fixed, and we can write 
f) = (L + prle™ — DY 
log fs,(u) = np,(e™ — 1) + Onp,(e™ — 1), [6] < 1. 
Since p, — 0, np, — 2, np?, — 0, this gives 
log fs (u) > Ae — 1), 


so 


Theorem 9.4. S, e if and only if np, — A, then the limit has characteristic 
function exp [A(e'" —1)]. The limit random variable X takes values in 


{0,1,2,...}, so 
Ee” = > e™*P(X = k). 
0 
Expanding, 


eile’ “ 4) ae. =e A 
0 


sO 


Definition 9.5. A random variable X taking values in {0, a, 2a, 3a,...} will 
be said to have Poisson distribution S(A), à > O with jump size a, if 


P(X = ak) = Te, or f(u) = exp [A(e*** — 1)]. 


Look now at the X{”,..., X. Since P(X!” = 0) = q, — 1, usually 
the X™ are zero, but once in a while along comes a blip. Again, take M, = 
max (X™®, ..., X). Now M, can only take the values 0 or 1, and M, ++ 0 
unless 2 = 0. Here the contrast obviously is that M, must equal 1 with 
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positive probability, or S,, 2 0. It is the difference between the sum of 
uniformly small smears, versus the sum of occasionally large blips. That 
this is pretty characteristic is emphasized by 

Proposition 9.6. Let S, = X™ +-+ X™, where X™,..., X™ are 
independent and identically distributed. If S, 2. X then M, 2,0 if and 
only if X is normal. 

Proof. Deferred until Section 7. 

The Poisson convergence can be generalized enormously. For example, 
suppose S, = Xi") +--+ + X™, the X™ independent and identically 
distributed with 

Kin an 


(i) 


5 with probability q, 
x; With probability p}, i= 1,...,j. 


and S,, 2} X. We could again show that this is possible only if np‘) > À; 
0 <A; < œ, and if so, then 


falu) = exp (2 ae — D). 


Two interesting points are revealed in this result. 


First: The expected number of times that Xj”? = x, is np”. So A, is roughly 
the expected number of times that one of the summands is x,. 


Second: Since 
i > 
f,{u) = IT exp (àle — 1)), 


X is distributed as X, + --- + X,, where the X, are independent random 
variables and X, has Poisson distribution S(A,) with jump size x, So the 
jumps do not interact; each jump size x, contributes an independent 
Poisson component. 


5. THE INFINITELY DIVISIBLE LAWS 

To include both Poisson and (0,1) convergence, ask the following 
question: Let 

(9.7) S$. = XY + + XM, 

where the X™, k = 1,...,n, are independent and identically distributed. 


TfS,, wes X, what are the possible distributions of X? 
S„ is the sum of many independent components; heuristically X must 
have this same property. 
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Definition 9.8. X will be said to have an infinitely divisible distribution if for 
every n, there are independent and identically distributed random variables 
Xi), 2. , X® such that L(X) = LOKI +--+ + KM). 
Proposition 9.9. A random variable X is a limit in distribution of sums of the 
type (9.7) if and only if it has an infinitely divisible distribution. 
Proof. If X has an infinitely divisible distribution, then by definition there 
are sums S, of type (9.7) with distribution exactly equal to X. The other 
way: Consider 

San = (XPP bb XR) OG Ho bX) = Ya H Yi 
The random variables Y„ and Y’, are independent with the same distribution. 
If S,, = X, the distributions of Y,, are mass-preserving, because 


(P(Y, > DF = PIV, > y, Yh > y) < P(Son > 2y), 


and similarly, 
(PY, < —y))? < PSen < —2y). 


Take a subsequence {n’} such that Y,, 2oy. Obviously, flu) = [fy(u)}*; 
so £(X) = £&(Y + Y’), Y,Y’ independent. This can be repeated to get X 
equal in distribution to the sum of Y{ + --- + Y™ by considering Spm. 


If S, 2 X, do the components X‘),... , X™® have to get smaller and 
smaller in any reasonably formulated way? Note that in both the Poisson 


and N’(0, 1) convergence, for any e > 0, P(|X\| > €)—> 0; that is, Xi”) 2,0 
[so, of course, sup, P(|X‘"’| > ©) — 0, since these probabilities are the same 
for all k = 1,...,n]. This holds in general. 


Proposition 9.10. If S, Rie X, then X™ Dy 0. 


Proof. Since f; (u) = J(u), there is a neighborhood N of the origin such 
that Rif, (u) > ô> 0, ue N, alln > 1. On N, |Arg fi (u)| is bounded 
away from m. Let f,(u) = fxim(u); so 
fs) = fa)” 
On N, then 
[fs (wl = lfa)”, 
Arg fs (u) = n Arg f,(u). 


So f,(u) > 1, for u € N, and now apply 8.37. 


Now I turn to the problem of characterizing the infinitely divisible dis- 
tributions. Let f(u) be the characteristic function of X. Therefore, since 
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£(X) = £(X) +--+ + XM”), there is a characteristic function f,(u) such 
that f(u) = [f,(u)]", and by 9.10, f(u) —> 1. Then, 

log f(u) = n log [1 — (1 — fD) 
(9.11) = n( falu) — 1)(1 + «,(u)). 


Since f,(u) #5 1, it follows that e„(u) => 0. Also, |f (u)| > 0, allu, otherwise 
f(u) = 0 implies f,(uo) = 0, all n, contradicting f,(u) “+ 1. Denote by 
F, the distribution function of X{"); then 


(9.12) log f(u) = (1 + €,(u)) Í (e — 1)n dF.. 
If we set up approximating sums of the integral in (9.12), we get 


f(u) = exp [D Ave — 1], 


exactly like the general Poisson case looked at before. Note also that if we 
put u,(B) = nF, (B), Hn a nonnegative measure on $B, then 


Un(B) = E (number of X™ e B). 


Since log f(u) = (1 + e,) f (e7 — 1)un (dx), if u„ converges to a measure 
u such that for continuous bounded functions g(x), f (x)u, (dx) > 


f p(u(dx), then we could conclude 
flu) = elle —wyalan) 


This is the basic idea, but there are two related problems. First, the total 
mass of u, is u,(R) = n, hence u,(R) — œ. Certainly, then, for p(x) = 1 
there is no finite u such that f p du, + f p du. Second, how can the N(0, 1) 
characteristic function e~“’/? be represented as above? Now, for any neigh- 
borhood N of the origin, we would expect more and more of the X™ to 
be in N; that is, u,(N)—> œ. But in analogy with Poisson convergence, the 
number of times that X{” is sizeable enough to take values outside of X 
should be bounded; that is, lim u,,(N°) < œ. We can prove even more than 


this. 


Proposition 9.13 
ss l/a 
lim u„[—a, + a]? < xa IRI log f(v)| dv. 
0 
Proof. By inequality 8.29, 


La 
ual—a, + al’ < xan | (t — RIf,(v)) dv. 
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Take the real part of Eq. (9.11), and pass to the limit to get 
lim n(1 — Rif,(u)) = —RI log f(u). 


Use |f(u)| > 0, all u, and the bounded convergence theorem for the rest. 


Since a, 
Ri log f(0) = 0, lim lim y,[—a, +a]* = 0, 
ato n 


so the 4, sequence is in this sense mass-preserving. What is happening is 
that the mass of u,, is accumulating near the origin, and behaving nicely 
away from the origin as n—> œ. But if then y(0) = 0, there is some hope 
that f g(x)u,,(dx) may converge. This is true to some extent, more exactly, 


Proposition 9.14 

Tim | x°u,(dx) < œ. 
[-1,+1] 

Proof. 


n(1 — RIf,(1)) = | (1 — cos x)u,(dx) 


> (1 — cos x)u,,(dx). 
(~1,+1] 


By Taylor’s expansion, cos x = 1 — x?/2 cos xa, |a] < 1, so there is a £, 
0 < B < œ, such that cos x < 1 — #x*, for |x| < 1. Thus 


n = RUD) > Bf saad, 


However, n(1 — R/f,(1)) > Ri log f(1), giving the result, since | f(1)| 4 0. 

By 9.13 and 9.14, if we define »,(B) = fp y(x)u,(dx), where g(x) is 
bounded and behaves like x? near the origin, the v,,(B) is a bounded sequence 
of measures and we can think of trying to apply the Helly-Bray theorem. 
The choice of g(x) is arbitrary, subject only to boundedness and the right 
behavior near zero. The time honored custom is to take g(x) to be x?/(1 + x?). 
Thus let «, = f (7/(1 + »*))z,(dy), 


G,(x) = i i ees (Sma 


making G,,(x) a distribution function. By 9.13 and 9.14 lim «, < ©. We 


can write 
yx = 1 


log f(u) = (1 + ea -| Ca G,(4x), 


but the integrand blows up as x — 0. So we first subtract the infinity by 
writing, 


log flu) = (1 + en f (2 = 1 = TS) dig + + en f EE die 


1+x 
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Then we write 6, = f x/(1 + x?) du,, so that 


iux j: +x? 


— G,„(d 
1+x) x? a 


(9.15) log f(u) = (1 + €,)a, f (e a4 
+ i(1 + €,)B,u. 


If the integral term converges we see that {8,} can contain no subsequence 
going to infinity. If it did, then 


exp [i8,u(1 + €,(u))] —> hu) 


would imply that on substituting u = v/f,, and going along the subsequence, 
we would get e* = A(0) for all v. If {8,} has two limit points £, 6’, then 
ef“ = ei’ hence B = f’. Thus, uc convergence of the first term entails 
convergence of the second term to ifu. 

The integrand in (9.15) 


iux \1 +x? 
x,u) = [e — 1 —- — 
P(x, u) ( a] 2 


is a continuous bounded function of x for x Æ 0. As x — 0, it has the limit 
—u?/2. By defining 9(0, u) = —u?/2, g(x, u) is jointly continuous everywhere. 
By 9.13, {G,} is mass-preserving. If lim «, = 0, take n’ such that «,, — 0 
and conclude from (9.15) that X is degenerate. Otherwise, take n’ such that 


oy > a > Oand Gy -2> G. Then G is a distribution function. Go along 
the n’ sequence in 9.15. The ĝ„ sequence must converge to some limit f 
since the integral term converges uc. Therefore 


tus i 1 + 3 q 
log f(u) = af(e Sis TEEI G(dx) + ipu. 
Suppose G({0}) > 0, then 
(9.16) log f(u) = ipu — n + af g(x, u)G(dx). 
toy 


We have now shown part of 


Theorem 9.17. X has infinitely divisible distribution if and only if its char- 
acteristic function f(u) is given by 


L ou? iua _ iux \i+? 
(9.18) log f(u) = ifu ~ + | (e 1 7a) 


(dx) 
14+xJ) x? i 
where v is a finite measure that assigns zero mass to the origin. 


To complete the proof: It has to be shown that any random variable whose 
characteristic function is of the form (9.18) has infinitely divisible 
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distribution. To begin, assume that any function f(u) whose log is of the 
form (9.18) is a characteristic function. Then it is trivial because if f,(u) is 
defined by 


log f,(u) = * log flu), 


then log f,(u) is again of the form (9.18); so f,(u) is a characteristic function. 
Since f(u) now is given by [f,(u)]” for any n, the corresponding distribution is 
infinitely divisible. The last point is to show that (9.18) always gives a charac- 
teristic function. Take partitions S,, of R into finite numbers of intervals 
such that the Riemann sums in (9.18) converge to the integral, that is, 


$ (e = eal naar) ae i ox, ur(dx). 
x 


i=1 i+x? : 
Put 6, = $ — S(o(I,)/x,), denote 
2 
a, = (i), 


t 


and write 


Sulu) = (exp [18,4 — (0?/2)u?]) > (exp [È Ae — 1))). 


See that g,(u) is the product of a characteristic function of a N°(B,,, o) dis- 
tribution and characteristic functions of Poisson distributions S(A,) with 
jump x;. Therefore by Corollary 8.35, g„(u) is a characteristic function. This 
does it, because 


lim g,(u) = f(u) 


for every u. Check that anything of the form (9.18) is continuous at 
u = 0. Certainly the first two terms are. As to the integral, note that 
sup, {p(x, u)| < M for all |u| < 1. Also, lim,_., p(x, u) = 0 for every x, and 
apply the bounded convergence theorem to get 


lim | g(x, u)o(dx) = 0, so lim f(u) = 1. 
uo u0 


By the continuity theorem, f(u) is a characteristic function. Q.E.D. 


6. THE GENERALIZED LIMIT PROBLEM 


Just as before, it becomes reasonable to ask what are the possible limit laws 
of 
$= Xi) fo $x 


if the restriction that the X{”,..., X™ be identically distributed is lifted. 
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Some restriction is needed; otherwise, take 
(C) an (n) — (n} — 
inh = Xi") = 0, Xi) = X, 


to get any limit distribution desired. What is violated in the spirit of the 
previous work is the idea of a sum of a large number of components, each one 
small on the average. That is, we had, in the identically distributed case, that 
for every e > 0, 


(A) sup P(X] >«)>0, as n— o. 
k 


This is the restriction that we retain in lieu of identical distribution. It is 
just about the weakest condition that can be imposed on the summands in 
order to prevent one of the components from exerting a dominant influence on 
the sum. With condition (A) a surprising result comes up. 


Theorem 9.19. If the sums S, oS X, then X has infinitely divisible distribution. 


So in a strong sense the infinitely divisible laws are the limit laws of large 
sums of independent components, each one small on the average. The 
proof of 9.19 proceeds in exactly the same way as that of Theorem 9.17, 
the only difference being that u,(B) = >” F™B) instead of nF,(B), but 
the same inequalities are used. It is the same proof except that one more 
subscript is floating around. 


Problem 1, Let X have infinitely divisible distribution, 


log f(u) = ifu + f g(x, u)o(dx) 


if »({0}) = 0, and if v assigns all its mass to a countable set of points, prove 
that the distribution of X is of pure type. [Use the law of pure types.] 


7, UNIQUENESS OF REPRESENTATION AND CONVERGENCE 


Let X have an infinitely divisible distribution with characteristic function 
f(u). Then by (9.18), there is a finite measure y(dx) (possibly with mass at 
the origin) and a constant $ such that 


(9.20) log f(u) = ifu + f g(x, u(dx) [27i], 


and g(x, u) is continuous in both x and u and bounded for x e R®, we 
[—U, +U), U < œ. Log f(u) is defined up to additive multiples of 2zi. 
Because |f(u)| = 0, there is a unique version of log f(u) which is zero when 
u is zero and is a continuous function of u on R®. Now (9.20) states that this 
version is given by the right-hand side above. 
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Proposition 9.21. (dx) and B are uniquely determined by (9.20). 


Proof. Let y(u) = log f(u); this is the continuous version of log f(u). 
Then (following Loéve [108]), take 


1 — 
T, -Í [e +h) + plu 2] dh 
0 2 
so that 6(u) is determined by y, hence by f. Note that 


[= +h) i p(x, u — | dh 


x? 


1 
g(x, u) — e (1 — cos (hx)) dh - l + 
0 x 


iú sin x\ | 
g(x, u) ~ e “(1 — =) =o 


x 


Hence, using (9.20), 
ou) = Í e'"tg(x)y(dx), 


where 
sin x\ 1 + x? 
y» = [1 -W , 
g(x) ( na z 


It is easy to check that 0 < inf g(x) < sup g(x) < œ. But 6(u) uniquely 
determines the measure (B) = fp g(x)y(dx), and thus y is determined as 
y(B) = fs [g(x)} (dx). If, therefore, 


y(u) = i'u + f g(x, u)y'(dx), 


then y = y’ implying £ = £’. 
The fact that y(dx) is unique gives us a handhold on conditions for 


Sh = X. Let y(dx) = «G(dx) where G is a distribution function. Recall 
that «, G(x) were determined by taking any convergent subsequence a, 
of «,, « =lim,.«,, and taking G(x) as any limit distribution of the 
G,,(x) sequence. Since «, G(x) are unique, then «, — a, Gn 2y G. Con- 
sequently £,, defined by 


Br =| ~— dyn 


1+ x? 
converges to f. Thus, letting y(x) = y(— œ, x), and 


y? 
naX) = a a „(d > 
i ( ) a + a ( ») 
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then S, DX implies y, 2. y, and {y,,} is mass-preserving in the sense that 
for any «e > 0, there exists a finite interval J such that Sup Yall) < €. These 
conditions are also sufficient: 


Theorem 9.22. S, 2> X where X has characteristic function given by 


log f(u) = iuf + | 96, u)y(dx) 


for y(dx) a finite measure if and only if the measures y,(dx) are mass-preserving 
in the above sense and 


: D “3 
i) Yn —~ Y: iü) Bn Sr B. 
Proof. All that is left to do is show sufficiency. This is easy. Since 


2 
y 
F,(dy) = y,(+0) < M < œ, 
pr (dy) = y, (+0) 


it follows that 


y? 
F,(dy) > 0; 
lee. my 


hence F,, converges to the law degenerate at zero. Thus for all ų in a finite 
interval, we can write 


log fo (u) = (1+ €,(u)LiB,u + f ox, u)y,(dx)], 
where e,,(1) =, 0. Thus 
log f(u) > ifu + f p(x, u)y(dx). 


Now we can go back and get the proof of 9.6. If S, = X® + -+ + X, 


and S, as X, then clearly X is normal if and only if y, converges to a 
measure y concentrated on the origin. Equivalently, for every x > 0, 
yn((—x, +x)*) > 0. Since 


ndx +99) = f 


this is equivalent to 


> nF ,(dy), 
Pree ay 


nF,((—x, +x))>0, x>0 or = nP(IX{"| > x) +0. 
But 
P(M, < x) = PXPI < x... IXP] < x) = [1 — PUIXY?| > x". 
Because X™ aS 0, 


log P(M, < x) = —n(i + 6,(x))P(IXi”| > x), 
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where 6,(x) — 0 for x > 0. Therefore, 


D 
M, —> 0 <> nP(|x!”| > x) +0, x>0, 
which completes the proof. 


8 THE STABLE LAWS 


Let X,, Xa, .. . be identically distributed, nondegenerate, independent random 
variables. What is the class of all possible limit laws of normed sums 


s -Xt tX 


(9.23) 2 - B,, A, > 02 


Since S,, may be written as 
Sp =X EXE, 
where 
xn = Xe = B, 
£ A,, n’ 
the requirement S,, aK implies that X is infinitely divisible. The condition 


xi 2, 0 implies 4, > œ, B,/n > 0. This class of limit laws is the most 
interesting set of distributions following the normal and Poisson. Of course, 


if EX? < œ, then A, ~ Vn and X must be normal. So the only interesting 
case is EX? = œ. Two important questions arise. 

First: What is the form of the class of all limit distributions X such that 
s, 2> x? 

Second: Find necessary and sufficient conditions on the common dis- 
tribution function of X,, Xz, . . . so that S, ZS x. 


These two questions lead to the stable laws and the domains of attraction 
of the stable laws. 


Definition 9.24. A random variable X is said to have a stable law if for every 
integer k > 0, and X,,..., X, independent with the same distribution as X, 
there are constants a, > 0, bẹ such that 


£(X, + +++ + X) = L(a,X + by). 


This approach is similar to the way we intrinsically characterized the 
normal law; by breaking S,, up into k blocks, we concluded that the limit of 


S,//n must satisfy £(X, + «+: + X,) = &(VkX). 


Proposition 9.25. X is the limit in distribution of normed sums (9.23) if and 
only if X has a stable law. 
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Proof. One way is quick: If X is stable, then £(X, +--+: + X,) = 
{(a,X + b,). Then (check this by characteristic functions), 


a | 
a > 


£X) = e( 


n n 


and we can take A, = a,, B, = b,/a,, to get 


A, á 


D 
(actually = X is true here). 
To go the other way, suppose 


D 
Za z — B, — X. 
A 


Then Z,, 2, Xasn —> oœ for all k. Repeat the trick we used for the normal 
law: j 
Za = (SP H H SP) Bno 
Ank 


where SW = X, +--+ + X,, SP = Xma Hee + Xem- Thus 
gt) si) A i 
z ere Pn _ gp) a 2 Coi, 
( 7 -) + + be a) A, nk + n,k 
where Cpe = (Ajx/An)Bnx — kB, By the law of convergence of types, 
Ank 


—> âk Cre dy. 


L(X, +++ + Xi) = L(a,X + by). 


This not only proves 9.25 but contains the additional information that if 


Sa 2, X, then A,,/A, — a, for all k. By considering the limit of Apmz/An 
as n — œ, we conclude that the constants a, must satisfy 


(9.26) Amk = AnQ, allk,m > 1. 


Therefore, 


9. THE FORM OF THE STABLE LAWS 


Theorem 9.27. Let X have a stable law . Then either X has anormal distribution 
or there is a number «,0 < «a < 2, called the exponent of the law and constants 
m, > 0, m, > 0, $ such that 

iux dx 


lo = iut {= 
esd = inf + mf (e a 


Of iux dx 
+ tur Es 1 im — 
ml ( ie | ix 
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Proof. Since X is stable, it is infinitely divisible, 
y(u) = log fx(u) = iuß + Í p(x, u)y(dx). 


In terms of characteristic function, the definition of stability becomes 


fer = etdetgvlaru) 


or 
(9.28) ky(u) = ib,u + (az). 
Separate the situation into two cases. 


I. y({0}) = 0, I. y({0}) > 0. 
CASE I. Define a measure u: 


nB) =| Ë xan. 


Then y is o-finite, u[—a, +a]’ < 00, for anya > 0, froig) x? du < œ, and 


y(u) = iuf F(= -1- ; rae :)e u(dx), 


= iunt 1 — raj d 
y(a,u) = ia,Bu +(e i+ (dx) 


= id + f (= = rea) ade) 
where 


sasaji rise 


This last integrand behaves like x? near the origin, and is bounded away 
from the origin, so the integral exists. Define a change of variable measure 
Mx by 
Uy(B) = uz; az € B), 
to get 
_ _iux 
1+x 


(aru) = idyu + f (e 


Therefore, (9.28) becomes 


3) meld). 


ink + Í (e S15 oo 3) kaldx) ee 


+f Fen 
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By the uniqueness of representation of infinitely divisible characteristic 
function we get the central result 


(9.29) ku = my. 
Let M+(x) = ulx, œ), x > 0. From (9.29), 
(9.30) kMt(x) = Mt(x/a,), x >0,k > 0. 


Similarly, let M-(x) = u({— œ, x), x < 0. Again, KM-(x) = M-(x/a,), 
x<0,k>0. 


Proposition 9.31. a, = k*, A > 0, and 
Mt(x) = xY M+(1), 
M~(x) = |x" M-(—1). 


Proof. M*(x) is nonincreasing. The relation kM+(1) = M+(1/a,) implies 
a, increasing in k, and we know a, — oo. For any k, a,a, = a, gives 


api = (a). 
For n > k, take j such that kf < n < k™!. Then 


aki L an Ê apri 
or 
log (ax) < log a, < log (ax), 


jloga, < loga, < (j + 1)log q. 
Dividing by j log k, we get 


Pets ç PET OET] itlog 
logk ` lognljlogk] ©? j logk` 


Now let n + œ; consequently j > œ, and (log n)/(jlog k) — 1, implying 


10gds — jim Es a >, 
log k logn 


To do the other part, set x = (k/n)*; then in (9.30), 
kM+((k/n)*) = Mt(1/n4), allk,n > 1, 
For k = n, this is nM+(1) = M+(1/n‘). Substituting this above gives 


kM+((k/n)*4) = nM*(1) 
or 
M*((k/n)*) = M+(1)/(k/n). 
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For all x in the dense set {(k/n)*} we have shown M+(x) = x~" M+(1). The 
fact that M+(x) is nonincreasing makes this hold true for all x. Similarly 
for M~(x). 

The condition ft, x? du < œ implies ft} x? |x[-@/)-1 dx < œ so that 
—1/A + 1 > —1, or finally A > 4. For y(u) the expression becomes 


y(u) = iuf + mf (e ul Ge ele 
0 


1 + x?) xi" 


0 y 
E mf m , — lux ) dx : 
Zo 1 + x?) jx 


1 1 
where m, = M*(1) -4> m = M(—-\):;, and « = f sod <a < 2. 
CASE II. If y({0}) = o > 0, then 


y(u) = iuß — z u? + { ye 20 (dx). 


The coefficient o? is uniquely determined by lim,..,,y(u)/u? = —o?/2, 
because sup, ,, |p(x, u)/u?| < œ and g(x, u)/u? > 0 for x # 0, as u — œ. 
Apply the bounded convergence theorem to fige (p(x, 4)/u*)y(dx) to get the 
result. Therefore, dividing (9.28) by u? and letting u — oo gives 

k=ał whichimplies 4 = }. 
So (9.28) becomes 


ky(u) = ibju + y(/ku) 


_ br 2 Yl ku) 
Se bar | 


As k — 0, wv ku)/ku? — —o*/2. This entails b,/k — p and 
y(u) = iuf — (0?/2)u?. Q.E.D. 


It is not difficult to check that every characteristic function of the form 
given in (9.27) is the characteristic function of a stable law. This additional 
fact completes our description of the form of the characteristic function for 
stable laws. 


Problems 


2. Use the methods of this section to show that Proposition 9.1 holds without 
the restriction EX? < œ. 
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3. Use Problem 2 to prove that if X;, Xs, ... are independent, identically 
distributed random variables and 
Xı + -+ Xn 5 
Jn 
then X is normal or degenerate. 
4. Show that if « < 1, then y(u) can be written as 


X, 


dx 
xte 


. = juz dx 2 iuz 
y(u) = iuf + m, i (e""* — 1) 7 +m,| (e%*— 1) 


Then prove that f = 0, m, = 0, implies X > 0 a.s. 


10. THE COMPUTATION OF THE 
STABLE CHARACTERISTIC FUNCTIONS 


When the exponent « is less than 2, the form of the stable characteristic 
function is given by 9.27. By doing some computations, we can evaluate 
these integrals in explicit form. 


Theorem 9.32. f(u) = e* is the characteristic function of a stable law of 
exponent 4,0 < « < 1l, and 1 < a < 2 if and only if it has the form 
y(u) = iuc — d lut + i0 = tana), 
lul 2 
where c is real, d real and positive, and 8 real such that |9| < 1. For a = 1, the 
form of the characteristic function is given by 
y(u) = iuc = d |u| ( + i0 2 Jog lui), 
ul 7 
with c, d, 8 as above. 


Proof. Let 
llu) =f" (ee ne Ss. 
0 


1+ x] 


Since /,(—u) = /,(u); we evaluate [,(u) only for u > 0. Also, 


Huja [ (eto jux ) Leg fm 


7, 1 + x? |x|2*¢ 


Consider first 0 < x < 1; then 


Pa dx >% 1 dx 
I = fur i — Í —- —, 
tw) =|" (em — 1) 4 ~ ("I 
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Substitute ux = y in the first term to get, for u > 0, 


I,(u) = iuc + u*H(o), 
where 


fat dy 
He) =[ a 
For 1 < « < 2, integrate by parts, getting 


iu f?/, d x dx 
I Pants cue peii 
(u) a Í (e dxi + al 


= ine + (e — je 
« Jo x 


aA d x E 
E A Ap 
TA dx 1 +x?) x° 


Substitute ux = y again, so 


where 


I,(u) = iuc + of he — 1), u>0. 
a 


If « = 1, the integration by parts gives 


T 
l(u) = iu lim f (e ENL h: 
0 


T+0 dx1+x?} x 
Let 
Tr, d x \dx 
HT, u) =| (e -£ as 
cy ING Er 


Then for u, > u, > 0, 


T 
KT, u) — KT, u) = Í (cits? — eina) fX 
0 x 


T Ug u ivT 
= aff e= do| =| i) dv. 
0 uy ui v 


Now, by the Riemann-Lebesgue lemma (see Chapter 10, Section 2), 


Ug wT 5 
lim f £ dv = lim SE tia = =0 
T u â U T v 


SINCE Yiu,uV)/v is a bounded measurable function vanishing outside a 


finite interval. This gives 


lim [J(T, uz) — J(T, u,)] = log (u,/u3). 


T>% 
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Consequently, checking that lim J(T, 1) exists, 
Po 


lim J(T, u) = —log u + c, 
T>% 


and 
I,(u) = iuc — iu logu, u>0. 


For the first time, the constant c appearing in the linear term is complex: 


f i d x \dx 
c= e — ——— | — 
a eF 


where c, is real. The integral Í (sin x/x) dx is well known and equals 7/2. 
Finally, then, 2 


I(u) = iuc, — (7/2)u — iulogu, u > 0. 
The remaining piece of work is to evaluate 


ORENA 
HO = [“C — 0-2) O<EKL 


This can be done by contour integration (see Gnedenko and Kolmogorov, 
[62, p. 169]}), with the result that 
H(é) = e0 PELE), 

where L(é) is real and negative. Putting everything together, we get 

mh (u) + mJ (—u) = mlu) + ml, (u). 
For0 < a < 1,and n > 0, 

y(u) = icu + u%(m,H(«) + mH («)) 
= icu — du*(1 + ið tan (7/2)a), 
where d = —(m, + m,)RIH(«) is real and positive, 6 = (m, — m,)/(m, + m) 
is real with range [—1, +1], and cis real. For 1 < a < 2, 
y(u) = icu + iT (m,H(a — 1) — m,H(a@ — 1)). 
a 


Now 
H(a -)= "aaa S C —l1= je HT/Day (g — 1), 
sO 
y(u) = icu — du*(1 + ið tan (m/2)a), u > 0. 


9.11 THE DOMAIN OF ATTRACTION OF A STABLE LAW 207 


Here d = ((m, + m,)/a)Rie*"*L(x — 1)) is real and positive, and 6 is 
again (m, — m,)/(m, + m,) with range [—1, +1]. If « = 1, then 


icu — (2/2)u(m, + mz) + iu(m, — m,) log u 
icu — du(1 + i0+ (2/m) logu), u> 0, 


plu) 


6 as above, d real and positive, c real. Q.E.D. 


Problem 5. Let S, = X, + -+ + X, be consecutive sums of independent 
random variables each having the same symmetric stable law of exponent 
a. Use the Fourier inversion theorem and the technique of Problem 24, 
Chapter 8, to show that the sums are transient for 0 < « < 1, recurrent 
for « > 1. The case « = 1 provides an example where E|X,| = œ, but 
the sums are recurrent. Show that for all « > 0, 


P(lim S, = œ) = Plim S, = — œ) = 1. 


Conclude that the sums change sign infinitely often for any « > 0. 


11. THE DOMAIN OF ATTRACTION OF A STABLE LAW 


Let X4, Xa, . . . be independent and identically distributed. What are necessary 
and sufficient conditions on their distribution function F(x) such that S, 
suitably normalized converges in distribution to X where X is nondegenerate? 
Of course, the limit random variable X must be stable. 


Definition 9.33. The distribution F(x) is said to be in the domain of attraction 
of a stable law with exponent « < 2 if there are constants A, B, such that 


S D 
22 B, >X 
A 


and X has exponent «. Denote this by F € D(a). 
Complete conditions on F(x) are given by 


Theorem 9.34. F(x) is in the domain of attraction of a stable law with exponent 
« < 2 if and only if there are constants M+, M~ > 0, M+ + M7 > 0, such 
that as y > œ: 


1—F(y) Mt 
ii) For every £ > 0, 
modsi ete SS. 
1-Fy) & 
M > 0 = lim EEN = 1 


F(-y) ë 
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Proof. We show first necessity. By Theorem 9.22, y, = y. Now take 
b, = B,/n. Then 


2 
y 
„(B =n Fd ; 
y(B) las (dy) 
where 


F(x) = da hy x) = F(A,(x + b,)), 


and 


y(B) -Í i F 7 u(dy). 


Thus, for any x, < 0, x, > 0, since the y,, must be mass-preserving, 


1+7? 1+ 2? 
Pora yada) [AS va) 


7 z 


l4 


This is 


nf F (dz) ->f u(dz). 
(21 ,22)° fay ,2%2)° 


By taking one or the other of x,, x, to be infinite, we find that the con- 
dition becomes 
nfl — Fœ] > MQ), x > 0, 
nF,(x) ~ M(x), x <0. 
Then 
nfl — F,(A,(x + 6,))] > M'O), x > 0, 
nF,(A,(x + 5,)) > M(x), x <0. 
Since b,, — 0, for any « > 0 and n sufficiently large, 
F(A,(x — e + b,)) < FIA,X) S F(A + € + ba) 
We can use this to get 
M(x + ©) < lim nll — F(A,x)] 
< lim n{l — F(4,9)] < M(x — à x > 0, 
M (x — e) < lim nF(A,x) 
< limnF(A,x) < M(x +6, x <0. 
We know that M*(x), M (x) are continuous, so we can conclude that 
nil — F(A,x)]) > M(x, x > 0, 
nF(A,x) > M(x), x <0. 
Now if we fix x > 0, for any y > 0 sufficiently large, there is an n such that 
A,X SY S Ani: 
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Then 
F(A,x) < F(y) < F(Am1xX), 
F(—Ansix) < F(—y) < F(~—4,%). 
So for any € > 0, 


1 — F(Aniéx) — L— Fly) — 1 — F(AnEx) 

1— F(A,x) ` 1— FQ) “1 Ffraw) 
M*(&x) < lim L= F(éy) <imi= F(éy) < M*(x) 
M*(x) ~~ 1— Fly) » 1— Fy) ` M*(x) 


=> lim 1 — Fy) _ 1 3 
vro 1 — FQ) & 


In exactly the same way, conclude 


tim “ow E, 
veo F(—y) & 
Also, 


F(—AnsaX) < F(—y) < F(—A,x) 
1 — F(A,x) ` 1 — FO) ` 1 — F(Ayyix)’ 
which leads to 


F(—y) _ M(—x)_ M(—1) 


lim 


To get sufficiency, assume, for example, that M+ > 0. I assert we can 
define constants A, > 0 such that 


n(1 — F(A,)) > M*. 
Condition (ii) implies 1 — F(x) > 0, all x > 0. Take A, such that for 
any e > 0, n(1 — F(A,)) > M*, but n(1 — F(A, + ©) < M*. Then if 


jim n(1 — F(A,)) = M*(1 + ô), ô > 0, there is a subsequence n’ such that 
for every e > 0, 


This is ruled out by condition (ii). So 


1 — F(A, x) eI 

1 — F(A,x)) = n(1 — F(A,)): — > > Mt. 
nÇ — FA) = al = FAD) RD M 
Similarly, 


nF(—A,x) > M`: 3 
x 
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Take F to be the distribution function of a random variable X,, and write 


u,(B) = nP Fa e B), B E€ By. 


n 


Then 
+ 
BAX, o) » x>0, 
u ((— ©, De > X < 0. 


Therefore, the {u,} sequence is mass-preserving, and so is the {y,,} sequence 
defined, as before, by 


y? 
B) = dy), Be. 
Ya(B) eer. y) 1 


Take f(x) any bounded continuous function, and e > 0: 


frornao=f soontan+| Sr 


Let g(x) = f(x)x?/(1 + x°). The first integral converges to 


aM* Í * (x) 


dx tei dx 
ae [. BO) Test 


o d = 0 d 
= amf g(x) a + aM f w I +4,(€), 


where 6,(c) — 0 as e > 0. Thus, defining y(x) in the obvious way, 


fs (x)y,(dx) — Í S(x)y(ax)| < ôe) + lim foa If(>)| ya(dx). 


lim 


To complete the proof that y, ay y, we need 


Proposition 9.35 
im | xudo = (0) 
lal <e 


where 6,(€) — 0 as e + 0. 


In order not to interrupt the main flow of the proof, I defer the proof of 
9.35 until the end. Since A, — œ, the characteristic function g,(u) of 
X, +--+ + X,/A, is given, as before, by 


log g,(u) = (1 + «,(u)) [oc u)y,(dx) + (1 + €,(u))iug,, 
where 


x 
Ba = (oe #,(dx) 
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and «,(u) —» 0. Since y, By y, the first term tends toward f p(x, u)p(dx). 
The characteristic function of $,/A,, — B,, is e~*“#=g_(u). So, if «,(u)B, — 0, 
then the theorem (except for 9.35) follows with B, = f. For n sufficiently 
large, 


[endl < [1 — fU/A,)I, 
where f is the characteristic function of X,. But 


i(f)-1= » f (e1 — 1)u, (dx) = al f ox, u)ya(dx) + upa), 


so it is sufficient to show that 82?/n +0. For any e > 0, use the Schwarz 
inequality to get 


im? < mif plado] 
nLJiri<e 


n 


< lim [; p(s, +8) f a ud) | 


< iim f x"u,(dx), 
lzl<e 


Apply 9.35 to reach the conclusion. 
Proof of 9.35. We adapt a proof due to Feller [59, Vol. II]. Write 


I(x) = f "y(t — FQ)) dy. 


We begin by showing that there is a constant c such that 


I(x) < cx*(1 — F(x)). 


Fort >r, 
(9.36) I(tx) = I(rx) + x? Í E(1 — F(xé)) dé. 
Fix x > 1. For any e > 0, take 7 so large that for Ẹ > 7, 
LED > (i — «x. 
1 — F(é) 


From (9.36) 
I(tx) > (rx) + x1 — (D — I(r). 


This inequality implies Z(t) — œ as t —> œ. Then, dividing by I(t) and 
letting t —> œ yields 

lim 10s) DKS 

~~ A 
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Now in 9.36 take 7 = 1, so 
2 — — 
Hex) y p = FO) (°1 = FOS), a ge, 


I(x) I(x) 1 1 — F(x) 
The integrand is bounded by &, so 
I(tx) x*(1 — F(x) 
ot. 
1) =" IO) 


Thus, taking lim inf on x, 


e< 1+ Pin ey) 
I(x) 


This does it. 
To relate this to 9.35 integrate by parts on fioe) x*u,(dx), assuming e 
is a continuity point of yu, for all n. This gives 


f Smf svelte dx. 
{0,€) 10,€) 
From ,([x, 0)) = n(1 — F(A,x)), we get 


om [stata < BS 0- FO) dy = Hean 
10,0) Až Jo Až 


Because of the inequality proved above, 


Í xu, (dx) < 2nce*(1 — F(4,6)). 
[0,€) 


Therefore, 
lim Í xu,ldx) < 2cMte**, 
{0,€) 


The integral over the range (—e, 0) is treated in the same way, with F(— x) 
in place of 1 — F(x). Q.E.D. 


Problems 
6. Show that n/A2 +0. [For any 6 > 0, take xo such that for x > xo 
1— F(2x) > Q-bata) 
1 — F(x) 
Define k by 2-1 < A, < 2*; then 
la F(2’) -1 
1 — F(A,) > 1 — FQ) = 1 — F(2™™)). 
(A,) > 1 — FQ") 1s ae e) 


Now select m appropriately.] 
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3 


7. Show, using the same methods used to get n/A? — 0, that for F(x) € D(a), 


ae tee _{%, A>, 
pute he POs | 0, A<a. 
Conclude that 


A <0, A < a, 
fi Fa) |S eh 


8. For F(x) € D(a), a < 1, by using the same methods as in the proof of 
9.35, show that 


jim Í |x] u,(dx) = d(e), de) >0 as e—0. 
Ja|Se 


Conclude that f,, converges to a finite constant B; hence B, can be defined 
to be zero, all n. 


9. For F(x) € D(a), 1 < « < 2, show that 
x 
| alas) - Í xp,(dx) 
converges, so B,/n can be taken as —EX,/A,; that is, X,, X2,... are re- 
placed by X, — EX,, X; — EX,,... 
10. For F(x) € D(a), a < 1, if F(x) = 0, x < 0, then 


log fsal) = Í “(e* — tyu,(dx) ETT) 


Prove, using Problem 8 and the computations of the various integrals 
defining the stable laws, that 


S ; 
2 2.x, where log fx(u) = MtL(a)e Pu", u> 0. 


A 


n 


12. A COIN-TOSSING EXAMPLE 
The way to recognize laws in D(a) is 
Definition 9.38. A function H(x) on (0, œ) is said to be slowly changing if for 
all £ € (0, œ), 
.  H(x) 

lim—— =1 

z> H(x) 
For example, log x is slowly changing, so is log log x, but not x°, « Æ 0. 
Proposition 9.39. F(x) € D(«) if and only if, defining H*(x), H~(x) on (0, œ) 


y 1 1 
1 — F(x) = — H*(x), F(—x) = — H~(x), 
x x 
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there are constants M* > 0, M7 > 0, M* + MT > 0 such that 
M* > 0 => H*(x) slowly changing, M` > 0 => H~(x) slowly changing, 


and, as x + ©, 


Ht) _ Mt, 
Hx) M- 


Proof. Obvious. 


Now, in fair coin-tossing, if T, is the time until the first return to equilib- 
rium, then by Problem 19, Chapter 3, 


2 1 
P(T, > n) = a + d,), 


where 6, — 0, so 1 — F(n) ~ cen™?, Thus for the rangen <x <¢n+1, 
1 — F(x) ~~ cn? or finally, 1 — F(x) = cx-¥*(1 + 6(x)), 6(x) + 0. There- 
fore, by 9.39, F(x) € D(4). To get the normalizing A,, the condition is 
n(1 — F(A,)) converges to M+, that is, 


cn =(1 + 8(A,))—> M*. 
Take A, = n?; then M+ = c = V2/Vm. We conclude, if R, is the time of 


the nth return to equilibrium, then 


Theorem 9.40. R,,/n? ie X, where 


2 —il(z/2)a 
log i) = Ee CDa), a=}, u20. 
Proof. See Problem 10. 


13. THE DOMAIN OF ATTRACTION OF THE NORMAL LAW 


If we look at the various possible distribution functions F(x) and ask when 
can we normalize S, so that $„/A„ — B, converges to a nondegenerate limit, 
we now have some pretty good answers. For F(x) to be in D(«), the 
tails of the distribution must behave in a very smooth way—actually the 
mass of the distribution out toward œ must mimic the behavior of the tails 
of the limiting stable law. So for 0 < « < 2, only a few distribution 
functions F(x) are in D(«). But the normal law is the limit for a wide class of 
distributions, including all F(x) such that f x?F(dx) < œ. The obvious 
unanswered question is: What else does this class contain? In other words, 


for what distributions F(x) does S,/A, — B, 25 JV(0, 1) for an appropriate 
choice of A,, B,,? 
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We state the result only—see Gnedenko and Kolmogorov, [62, pp. 
172 ff.], for the proof. 


Theorem 9.41. There exist A,, B, such that S,/4, — B, > N(O, 1) if and 


only if 

x? f F(dy) 
(9.42) lim -e 
ae f y*F(dy) 


OEL 


= 0. 


Problem 11. Show that f y?F(dy) < œ implies that the limit in (9.42) is zero. 
Find a distribution such that f y?F(dy) = œ, but (9.42) is satisfied. 


NOTES 


The central limit theorem 9.2 dates back to Lyapunov [109, 1901]. The 
general setting of the problem into the context of infinitely divisible laws 
starts with Kolmogorov (94, 1932], who found all infinitely divisible laws 
with finite second moment, and Paul Lévy [102, 1934], who derived the 
general expression while investigating processes depending on a continuous- 
time parameter (see Chapter 14). The present framework dates to Feller 
{51] and Khintchine [89] in 1937. Stable distributions go back to Paul 
Lévy [100, 1924], and also (103, 1937]. In 1939 and 1940 Doeblin, [27] and 
[29], analyzed the problem of domains of attraction. One fascinating 
discovery in his later paper was the existence of universal laws. These are 
laws £(X) such that for S$, = X, +--+: + X, sums of independent random 
variables each having the law £(X), there are normalizing constants A,, 
B,, such that for Y having any infinitely divisible distribution, there is a 
a subsequence {n,} with 


For more discussion of stable laws see Feller’s book [59, Vol. Il]. A much 
deeper investigation into the area of this chapter is given by Gnedenko and 
Kolmogorov [62]. 


CHAPTER 10 


THE RENEWAL THEOREM 
AND LOCAL LIMIT THEOREM 


1. INTRODUCTION 


By sharpening our analytic tools, we can prove two more important weak 
limit theorems regarding the distribution of sums of independent, identically 
distributed random variables. We group these together because the methods 
are very similar, involving a more delicate use of characteristic functions and 
Fourier analytical methods. In the last sections, we apply the local limit 
theorem to get occupation time laws. This we do partly because of their own 
interest, and also because they illustrate the use of Tauberian arguments and 
the method of moments. 


2. THE TOOLS 
A basic necessity is the 


Riemann-Lebesgue lemma. Let f(x) be 2,-measurable and f | f(x)| dx < ©; 
then 


(10.1) lim | e*f(x) dx = 0. 


u>+to 


Proof. For any e > 0, take J a finite interval such that 


fuos: 
and take M such that 


i fool dx < €. 
(f(a) >M} 
Therefore, 


< 2e, 


IODH SM} 


ferw dx — e"f (x) dx 


so it is sufficient to prove this lemma for bounded f(x) vanishing off of 
finite intervals J. Then (see Problem 5, Chapter 5) for any e > 0, there are 
disjoint intervals J,,..., J, such that 


J 


dx < e. 


f(x) - È atn) 
216 
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By direct computation, 
Í ety, (x) dx >0 
as u goes to +œ. 

Next, we need to broaden the concept of convergence in distribution. 


Consider the class Q of o-additive measures u on B, such that (J) is finite 
for every finite interval J. 


Definition 10.2. Let u,, p EQ. Say that uw, converges weakly to H, fn lire by 
if for every continuous function p(x) vanishing off of a finite interval 


(10.3) f gpu, (dx) > f g(x)qu(dx). 


If the yw, u have total mass one, then weak convergence coincides with 
convergence in distribution. Some of the basic results concerning conver- 
gence in distribution have easy extensions. 


Proposition 10.4. p, a u iff for every Borel set A contained in a finite 
interval such that u(bd(A)) = 0, 
H(A) ~> u(A). 

The proof is exactly the same as in Chapter 8. The Helly-Bray theorem 
extends to 
Theorem 10.5. If u, € Q and for every finite interval I, lim u,„(I) < 00, then 
there is a subsequence y,, converging weakly to u € Q. 
Proof. For I, = [—k, +k], k = 1,2,..., use the Helly-Bray theorem to 
get an ordered subset N, of the positive integers such that on the interval 
Iy, fin —> pe") as n runs through N,, and N,., © N,. Here we use the obvious 
fact that the Helly-Bray theorem holds for measures whose total mass is bounded 
by the same constant. Let u be the measure that agrees with u“) on Borel 
subsets of J,. Since N;4, © Ny u is well defined. Let n, be the kth member of 
N,. Then clearly, y,,, => u 

There are also obvious generalizations of the ideas of separating classes 
of functions. But the key result we need is this: 
Definition 10.6. Let X be the class of B,-measurable complex-valued functions 
h(x) such that 

{inert dx < œ 

and 


h(x) = f e"*h(u) du, 


where h(u) is real and vanishes outside of a finite interval I. 


Note that if h € X, then for any real v, the function e/(x) is again in £. 
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Theorem 10.7. Let u,, 2 ¢Q. Suppose that there is an everywhere positive 
function hy € KR such that fho dp is finite, and 


fh dua +f du 


for all functions h € £ of the form e** h(x), v real. Then u, > py. 


Proof. Let «, = fho dun, a = f hodu. Note that ho, or for that matter 
any h e€ &, is continuous. If « = 0, then u = 0, and since a, — a, for any 
finite interval J, u,(7)— 0. Hence, assume « > 0. Define probability 
measures »,,» on B, by 


1 1 
vB) = > | hotud), B) = È [hod 
a, JB a JB 
By the hypothesis of the theorem, for all real v, 
Í ety (dx) = fenan), 


so that v, —> v. Thus, by 8.12, for any bounded continuous g(x), 


Í 2(x)la(x)ftg(dx) > f a(xdha(x)u(dx). 


For any continuous g(x) vanishing off of a finite interval 7, take g(x) = 
9(x)/hg(x) to conclude 


f pxu, (dx) > f g(x)u(dx). 


A question remains as to whether X contains any everywhere positive 
functions. To sce that it does, check that 


i è 
inh = a), a>o0 
Ax 


in JC, with 
224 — 427, Ju 2A, 
ee f ļul)/ jul < 
0, lul > 2A. 


Now, for A,, 4, not rational multiples of each other, the function A(x) = 
h(x) + h,,(x) is everywhere positive and in JC. 


3. THE RENEWAL THEOREM 


For independent, identically distributed random variables X,, Xs,..., 
define, as usual, S$, = X, +--+ + X, So = 0. In the case that the sums 
are transient, so that P(S, € 7 i.o.) = O, all finite 7, there is one major result 
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concerning the interval distribution of the sums. It interests us chiefly in the 
case of finite nonzero first moment, say 0 < EX, < œ. The law of large 
numbers guarantees S,, — 00 a.s. But there is still the possibility of a more or 
less regular progression of the sums out to œ. Think of X,, Xz, .. . as the 
successive life spans of light bulbs in a given socket. After a considerable 
number of years of operation, the distributions of this replacement process 
should be invariant under shifts of time axis. For instance, the expected 
number of failures in any interval of time 7 should depend only on the length 
of J. This is essentially the renewal theorem: Let /, assign mass d to every 
point of La, l is Lebesgue measure on Ly. 


Theorem 10.8. Suppose X, is distributed on the lattice La, d > 0. Let N(I) be 
the number of members of the sequence So, Sı, S2,... landing in the finite 
interval I. Then as y — œ through La 


Wl) 
EX, 


Remarks. The puzzle is why the particular limit of 10.8 occurs. To get some 
feel for this, look at the nonlattice case. Let N(B) denote the number of 
landings of the sequence So, S;,... in any Borel set B. Suppose that 
lim, EN(B + y) existed for all B € B,. Denote this limit by Q(B). Note that 
B,, B, disjoint implies N(B, U B,) = N(B,) + N(B,). Hence Q(B) is a 
nonnegative, additive set function. With a little more work, its o-additivity 
and o-finiteness can be established. The important fact now is that for any 
x E RY, BE B,, O(B + x) = Q(B); hence Q is invariant under translations. 
Therefore Q must be some constant multiple of Lebesgue measure, Q(dx) = 
x dx. To get a, by adding up disjoint intervals, see that 


EN((0, x) ~ ax. 
Let N(x) = N([0, x]}, and assume in addition that X, > 0 a.s., then 
N(x) = max {n; S, < x} 


lim ENU + y) = 
y 


By the law of large numbers, 
(10.9) m —> EX. 
Along every sample sequence such that (10.9) holds N(x) — œ, and 


im SN@ EX, lim sen EX,. 
z> 0 N(x) s> o N(x) +1 
Since 
Sna) x SN 


N(x) T N(x) 7 NE)? 
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then for the sequences such that (10.9) holds, 
NG) 1 
x EX, 
If we could take expectations of this then « = 1/EX,. (This argument is 
due to Doob [35].) 
Proof of 10.8. Define the renewal measure on B, by 
H(B) = EN(B), Be &,. 


The theorem states that H( + y) —> /,()/EX, as y goes to plus infinity 
through L,. For technical reasons, it is easier to define 


1 (B) = H(B + y) + H(—B = y)» Be B. 


The second measure converges weakly to zero as y goes to plus infinity 
through L. Because (from Chapter 3, Section 7) if / is any interval of 
length a, then 


ENC) < P(S, € J at least once) EN({[—a, +a)). 
So for J = [b — a, b] 
EN(I — y) < P(ints, <b- y)ENI-a, +a). 
The fact that EX, > 0, implies that the sums are transient, and EN(I) < œ 


for all finite intervals J. Since S, — œ a.s., inf, S, is a.s. finite. Hence the 
right-hand side above goes to zero as y — +œ. So it is enough to show that 


by i L,JEX,. Let hex. The first order of business is to evaluate 
f h(x)u,(dx). Since 


N(B) = SxS,), Bed, 
0 


then 
H(B) = 3 P(S, € B). 
Using i 
h(x) = f e““th(u) du 
gives 


f h(x)H(dx + y) = Š f f Aude™*P(S, € dx + y) du, 


where we assume either that A(x) is nonnegative, or f |h(x)| u,(dx) < œ. 
Note that 


fees, edx + y) ae em fePG, E dx) = mead fuU 
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so 
2 = 
fiona + y) = $ f ehois] du. 
We would like to take the sum inside the integral sign, but the divergence 


at u = 0 of > [f(u)]” is troublesome. Alternatively, let 
0 
H,(B) = Sr"P(S,€B), O<r <1. 
0 
Compute as above, that 
Í h(x)H,(dx + y) = r" f e-hçu)[f(u)]" du. 
0 


Now there is no trouble in interchanging to get 


[Aconex + y) -j u 


In the same way, 


fhOH, ax — y) sfe EET du, 


where f(u) denotes the complex conjugate of f(u). Then 


(10.10) fioa) =2 lim feon Ri (i) du 


The basic result used now is a lemma due to Feller and Orey [60]. 


Lemma 10.11. Let f(u) be a characteristic function such that f(u) # 1 for 
O < [u| <b. Then on |u| < b, the measure with density RI(1/1 — rf(u)) 
converges weakly as r Î 1 to the measure with density RIQ/1 — f(u)), plus a 
point mass of amount 7/EX, at u = 0. Also, the integral of |RI(1/1 — f(W) 
is finite on |u| < b. 


This lemma is really the core of the proof. Accept it for the moment. In 
the nonlattice case, f(u) # 1 for every u Æ 0. Thus, the limit on the right- 
hand side of (10.10) exists and 


f h(x) u,(dx) = 27 w +2 f chor r) du 


Apply the Riemann-Lebesgue lemma to the os on the right, getting 


lim f hold) = Fe MO), 
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The inversion formula (8.40) yields 
; 1 
a once = EX, [reo dx, 


which proves the theorem in the nonlattice case. In the lattice case, say 
d = l, put 


g(u) = hu + 27k). 


Since y € Z,, use the notation uw, instead of u,. Then (10.10) becomes, 
since f(u) is periodic, 


f h(x)ye,(dx) = 2 lim [ewr i) Ji 


Furthermore, since L, is the minimal lattice for X,, f(u) # 1 on 0 < |u| < 7. 
Apply lemma 10.11 again: 


Í h(x)u,(dx) = 27 +2 O P) du. 


The Riemann-Lebesgue lemma gives 


; 0) 
l h dx) = 2 g0) 
im i mlda) = 27 
Now look: 
+7 
h(m) = ferria du =Í e™"g(u) du. 


A +2 
By the inversion formula for Fourier series, taking A(u)sothat ¥ |A(m)| < œ, 


1 +0 . 
g(u) = — > e h(n), 
2r =% 


0) = 5 f h(x)h(dx), 


finishing the proof in the lattice case. 


Now for the proof of the lemma. Let ¢(u) be any real-valued continuous 
function vanishing for |u| > b, such that sup |p| < 1. Consider 


[Jor — 5 - a] 


_ (dane) ,{f@U — f@) 
=o fi ot 1 — fe) | 4 
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For any e > 0, the integrand converges boundedly to zero on (—e, +e)° as 
rt 1. Consider the integral over (—e, +e). The term in brackets equals. 


fm — f) HON 
10.12 JWA TIW) L a pe 
( ) 1 — f(u) P or — f(u) 


Using |1 — rf| > 1 — r, we find that the integral above containing the 
second term of (10.12) over (—e, +e) is dominated by 


Í 5 IRI(1/1 — f)| du. 


Assuming that |R/(1/1 — f)| has a finite integral, we can set this term 
arbitrarily small by selection of e. The function (1 — f)/(1 — f) is con- 
tinuous for 0 < |u| < b. Use the Taylor expansion, 

f(u) = 1 + (EX + ôlu)), 


where 6(u) 0 as u— 0, to conclude that the limit of (1 — f)/(1 — f) 
exists equal to —1 as u->0. Define its value at zero to be —1 making 
(1 ~f)/(1 — f) a continuous function on |u| < b. Then the integral 
containing the first term of (10.12) is i by 


1— 
10.13 e = aa aie 
(013) imf ji — ae ye 8) a 


where g(u) is a continuous function such that g(0) = (0). Denote m = EX. 
Use the Taylor expansion again to see that for any A > 0, e can be selected 
so that for |u| < « 


GQ — AIG — A + wm] < [1 — of? < (1 + AIA — 9? + wm’), 


Combine this with 


i-r ; 1 
lim ttt = lim | —— dv 
rti b (1 = r? + um? rtl jol<e/1-r 1 + vm? 


-o 1 + vêm? m 


to conclude that the limit of (10.13) is mø(0)/m. The last fact we need now 
is the integrability of 
Ri( 1 ) _ RI — fu) 
— f(u) |1 — f(u)? 
on |u| < b. Since |1 — f(u)? > m?u?/2 in some neighborhood of the origin, 
it is sufficient to show that R/(1 — f(u))/u? has a finite integral. Write 


E RUFC) — 1)= i [Re — 1 — isin ux) dF(x), 
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so that 
1 Tass = 
| =; RU(L — fw) du < f | Slee — 1 — i sin ux] F(x) du 


<J | |x| dF(x), 


where the orders of integration have been interchanged, and 
1 iur HED- 
J = | -z3 le"? — 1 — isin ux| |x| du 
u°x 


= f le” — 1 — isin ol do < oo, 
v 


4. A LOCAL CENTRAL LIMIT THEOREM 


For fair coin-tossing, a combinatorial argument gives P(S,, = 0) ~ 1 IN mmn, 
a result that has been useful. What if the X,, X3, . . . are integer-valued, or 
distributed on the lattice L? More generally, what about estimating 
P(S, € 1)? Look at this optimistic argument: 
(10.14) P(S, < o V/nx) > Ox) 
if EX, = 0, EX? = o? < œ. By Problem 4, Chapter 8, the supremum of the 
difference in (10.14) goes to zero. Thus by substituting y = o Vn x, 

Piy, < S, < Lp ee dy 0 

is n L< Yo) mhe X=. 


Substitute in the integral, and rewrite as 


á 2 
(10.15) P01 < Sn < ya) — I aton ge — 0, 


Yı 


o./2an 
If the convergence is so rapid that the difference in (10.15) is o(1 In), then 
multiplying by ov 2nn gets us 
— ká 2, 2 
o 2mnP(yi < Sn < ye) =i et no*n dé — 0. 
wv 

The integrand goes uniformly to one, so rewrite this as 

ov 2mnP(y, < Sa < y) > ye — Yr 


This gives the surprising conclusion that estimates like P(S,,, = 0) ~ 1 wN mn 
may not be peculiar to fair coin-tossing, but may hold for all sums of in- 
dependent, identically distributed random variables with zero means and 
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finite variance. The fact that this delicate result is true for most distributions 
is a special case of the local central limit theorem. It does not hold generally— 
look at coin-tossing again: P(S, = 0) = 0 for n odd. The next definition 
restricts attention to those X,, X,,... such that the sums do not have an 
unpleasant periodicity. 


Definition 10.16. A random variable X is called a centered lattice random 
variable if there exists d > 0 such that P(X € Lz) = 1, and there is no d' > d 
and « such that P(X € a + Ly) = 1. X is called centered nonlattice if there 
are no numbers « and d > 0 such that P(X E€ «a + La) = 1. 


For example, a random variable X with P(X = 1) = P(X = 3) = 4 is 
not centered lattice, because L, is the minimal lattice with P(X e L) = 1, but 
P(XeEl+L,) = 1. 

As before, let /, assign mass d to every point of L,, and /, denotes Lebesgue 
measure on Lo. 


Theorem 10.17. Let X,, Xə, . . . be independent, identically distributed random 
variables, either centered lattice on L, or centered nonlattice on Ly, with 
EX? = œ < œ. Then for any finite interval I, 


(10.18) oV2nnP(S, € 1) > L(A). 


Proof. In stages—first of all, if X is centered nonlattice, then by Problem 21, 
Chapter 8, | f(u)| < 1, u Æ 0. If X is centered lattice on L,, then f(u) has 
period 27/d and the only points at which |/(u)| = 1 are {2rk/d}, k = 0, 
+1, . . . Eq. (10.18) is equivalent to the assertion that the measures u„defined 
on B, by 


ua(B) = 0 2nnP(S, € B), 


converge weakly to the measure /,. The plan of this proof is to show that for 
every A(x) € Æ 


(10.19) o,/ImnEW(S,) > Í h(x), (a), 
and then to apply Theorem 10.7. 


Now to prove (10.19): Suppose first that |f(u)| # 1 on J — {0}, J 
some finite closed interval, and that (u) vanishes on J°. Write 


EKS,) = J f e™h(u) du dF (x), 
where F, is the distribution function of S„. Then 


EA(S,) = f (fu) PAu) du. 
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From 8.44, f(u) = 1 — (o%u?/2)(1 + d(u)), where d(u) +0 as u— 0. 
Take N = (—b, +b), so small that on N, |6(u)| < 3, ou? < 1. On J — N, 
ID < 1 — B,0 < B <1. Letting Ál = sup |A(w)|, we get 


(10.20) EA(S,) = Í [f(u)}"A(u) du + 0, WA — B, 10a} < Wi. 
On N, x 


(10.21) [fw < | js oe 


R a \a(u)| 


By the substitution u = v/\/n, 


By (10.21) the integrand on the right is dominated for all v by the integrable 
function 


pay ener 
But just as in the central limit theorem, [ fony — eno, Since 
hEN n) — Á(0), use the dominated convergence theorem for 


vaf [u)]"Alu) du > ho fet” dv = AOE. 
N o 
Use (10.20) to get 


- Ir 
(10.22) JEKS, ) ~ VZ ho). 
o 
When X is centered nonlattice, this holds for alt finite J. Furthermore, 
the Fourier inversion theorem gives 


1 f iuz 
A(u) = a fe h(x) dx. 


By putting u = 0 we can prove the assertion. In the lattice case, assume 


+o 
d = 1 for convenience, so f (u) has period 27. Let g(u) = ¥ h(u + 2rk) so 
that i 


1 +a 
EWS,) = + f [f(u)]"e(u) du. 


The purpose of this is that now |f(u)| 4 1 on [—a, +7} — {0}, so (10.22) 
holds in the form 


VnEMS,) > wee 2(0). 
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Just as in the proof of the renewal theorem, 
1 
© = z faon 
2r 
which proves the theorem in the lattice case. Q.E.D. 


Problem 1. Under the conditions of this section, show that 


a V2nnP(S, € I + x) + I) 
uniformly for x in bounded subsets of L,. [See Problem 4, Chapter 8.] 


5. APPLYING A TAUBERIAN THEOREM 


Let X,, Xa, .. . be centered lattice. Suppose we want to get information con- 
cerning the distribution of T, the time of the first zero of the sums S,, S.,... 
From (7.42), for all n > 0, 


1 = Š PS,» = P(T > m) 


with the convention Sọ = 0. Multiply this equation by r”, 0 < r < 1, and 
sum from n = 0 to œ, 


lL z (È "P(S, = D) (È "P(T > "). 
i-r 0 0 
The local limit theorem gives 
d 
P(S, = 0) ~ — =, 
( ) oJ 2an 


soasrfl, 


P(r) = $ r”P(S„ = 0) 
0 


blows up. Suppose we can use the asymptotic expression for P(S, = 0) to 
get the rate at which P(r) blows up. Since 


T(r) = È r°P(T >n) 
is given by j 
1 


PO) = = OPO)’ 


we have information regarding the rate at which T(r) — œ when r —> 1. 
Now we would like to reverse direction and get information about P(T > n) 
from the rate at which T(r) blows up. The first direction is called an Abelian 
argument; the reverse and much more difficult direction is called a Tauberian 
argument. 
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To get some feeling for what is going on, consider 
G(r) = È n'r", «> -1. 
1 
Put r = e~. For s small, a good approximation to g(r) is given by the 
integral 


wo 1 a 
-sr a N -2a 
f e **x* dx =] «x dx. 
0 S a 


The integral is the gamma function [(« + 1). Sinces~w1l—rasrf i, 
this can be made to provide a rigorous proof of 


T(« + 1) 


aon’ asrfl. 


(10.23) g(r) ~ 


Now suppose (10.23) holds for g(r) = >} a,r". Can we reverse and conclude 


0 
that a, ~ n*? Not quite; what is true is the well-known theorem: 


Theorem 10.24. Let (r) = È a,r", a, > 0, n = 0,... Then as n —> œ 
0 


a +t + anmenn, a>0, 
if and only if, as rf 1, 
g(r)~ cl — ry *P(@ + 1). 


For a nice proof of this theorem, see Feller (59, Vol. II, p. 423]. The implica- 
tion from the behavior of g(r) to that of a, + -+> + a, is the hard part and is 
a special case of Karamata’s Tauberian theorem [85]. 


We use the easy part of this theorem to show that 


d 


P(r) ~ (1 — ry 5 
oV2 


This follows immediately from the local limit theorem, the fact that [) = 


Vn, and 


[Approximate the sums above and below by integrals of 1 jVx and f/x + 1 
respectively.] 
Of course, this gives 


TI) ~ 2a —ry™ asrf i. 
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Theorem 10.25 


aT > mw Sa 


Proof. By Karamata’s Tauberian theorem, putting p, = P(T > n), c = 
28/2g/dv/mr, we get 
Pot: + pa~ evn. 


Since the p, are nonincreasing, for any m < n write 
(n — mPa S Pmi +33 + Pa < (n — MPa 
Divide by Vn, put m = [An], 0 < A < 1, and let n — œ to get 


(1 — a) lim Jnp, < cl — JA) < —* ) lim VMDn' 


Let A 1; then (1 — Va)/1 — A> 4, so Pis > el? 

Problem 2. Let N,,(0) be the number of times that the sums S,,..., S„ visit 
the state zero. Use the theory of stable laws and 10.25 to find constants 
A,, f © such that N,,(0)/A, -2> X where X is nondegenerate. 


6. OCCUPATION TIMES 


One neat application of the local central limit theorem is to the problem: 
Given a finite interval Z, S4, S2, . . . sums of independent, identically distributed 
random variables X,, X:,... such that EX, = 0, EX? = o? < œ, and take 
S = 0. Let N, be the number of visits of S, S,,..., S, to the interval J. Is 
there a normalization such that N,/A, i X, X nondegenerate? N, is the 
amount of time in the first n trials that the sums spend in the interval 7, hence 
the name ‘“‘occupation-time problem.” Let X,, X3, . . . be centered lattice or 
nonlattice, so the local limit theorem is in force. To get the size of A,,, compute 


EN, = È Ex) ~ > ah ~en. 


Hence we take A, = Vn. The way from here is the method of moments. Let 
p(n) = E(N,)* = 2 E(x(S;,) $ xx(S;,))- 
NZ dey... 51 ZO 
By combining all permutations of the same indices j}, . . . , ją we get 


fn) =k! nzi 2 >j EASA a a1S;,)) 
and ey are 
a(n) — u(n — 1) = k! 2 >j EOLAS) ey x1(S;,)). 


a> spp 2 
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Define transition probabilities on B, by 


pB] x) = PS,,; € BIS, = x) = P(S; € B — x). 
So 


k 
tln) ite s(n > 1) = k! > 6 A ` f T p“ (dxn | Xm) 


cS ERES Eai 
its tigen 
where x, = 0. If 


p(B| x) = ¥ r"p™(B | x) 


where p'(B| x) is the point mass concentrated on {x}, and 0 <r <1, 
then, defining u,(—1) = 0, 


20 k 
(10.26) 2 r”(e(n) za u(n — 1)) = kf: Ş | I] pdx | Xm-1). 


Proposition 10.27 


ue 


= 1 
= I — |, 
Vl — rp, PR a 


forxeL,asr fl. 


Proof. By Problem 1 of this chapter 
a V2 p™ | x) > LD) 


for x € Ly. Hence for all x€ J A La, J a finite interval, and for any e > 0, 
there exists ny such that for n > no, 
(tL — aD 
a/2an 

By 10.24 we get uc convergence of yi — r p(l | x) to AVENA OIEA 
Proposition 10.28 

; k/2 < n LAL) 

lim (1 — rY? S¥r"(u,(n) — a(n — 1)) = k Sh. 

rfi n=0 o/2 
Proof. On the right-hand side of (10.26), look at the first factor of the 
integrand multiplied by Vl =r, namely 

VI = r p, | Xr). 


This converges uniformly for x,_,¢€7 O La to a constant. Hence we can 
pull it out of the integral. Now continue this way to get the result above. 

With 10.28 in hand, apply Karamata’s Tauberian theorem to conclude 
that . 


në’? LD k 
A m~ —k! =). 
(10.23) mn) T((k/2) + 1) (5) 


(+ lad) 


of 2an 


< Prd x) < 
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This almost completes the proof of 


Theorem 10.30 
D 
N, ID) 


Va o2 


X, 
where 
P(X < x) = = f "e"n dt, 
T Jo 
Proof. By (10.29), 
i ; : 
(a) “iene a) 
m, = EX*. 


Let 


Use integration by parts to show that 


Mio = 2(k + 1)m, 
and deduce from this that 
! 
k = EX* 
T'((k/2) + 1) 


Proposition 8.49 implies that the distribution is here uniquely determined 
by the moments. Therefore Theorem 8.48 applies. Q.E.D. 


NOTES 


The renewal theorem in the lattice case was stated and proved by Erdös, 
Feller, and Pollard [49] in 1949. Chung later pointed out that in this case the 
theorem follows from the convergence theorem for countable Markov chains 
due to Kolmogorov (see Chapter 7, Section 7). The theorem was gradually 
extended and in its present form was proved by Blackwell [6, 7]. New 
proofs continue to appear. There is an interesting recent proof due to Feller, 
see [59, Vol. II], which opens a new aspect of the theorem. The method of 
proof we use is adapted from Feller and Orey [60, 1961]. Charles Stone [132] 
by similar methods gets very accurate estimates for the rate of convergence 
of H(B + x) asx — oo. A good exposition of the state of renewal theory as 
of 1958 is given by W. L. Smith [127]. One more result that is usually con- 
sidered to be part of the renewal theorem concerns the case EX = œ. What 
is known here is that if the sums Sp, S,, . . . are transient, then H(UJ + y) — 0 
as y > + for all finite intervals 7. Hence, in particular, if one of EX*, 
EX- is finite, this result holds (see Feller [59, Vol. II, pp. 368 ff). 

Local limit theorems for lattice random variables have a long history. 
The original proof of the central limit theorem for coin-tossing was gotten 
by first estimating P(S,, = j) and thus used a local limit theorem to prove the 
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tendency toward (0, 1). More local limit theorems for the lattice case are 
in Gnedenko and Kolmogorov [62]. Essentially the theorem given in the text is 
due to Shepp [122]; the method of proof follows [133]. In its form in the 
text, the central limit theorem is not a consequence of the local theorem. 
But by very similar methods somewhat sharper results can be proved. For 
example, in the centered nonlattice case, for any interval 7, let 


1 —27/ 202 


w= o 2r j 


Stone proves that 
sup | /nP(S, ET + x) — ID) g(x/./n)| > 0, 


and the central limit theorem for centered nonlattice variables follows from 
this. 

The occupation time theorem 10.30 was proven for normally distributed 
random variables by Chung and Kac [19], and in general, by Kallianpur 
and Robbins [84]. Darling and Kac [23] generalized their results significantly 
and simplified the proof by adding the Tauberian argument. 

The occupation time problem for an infinite interval, say (0, 0), is 
considerably different. Then N,, becomes the number of positive sums among 
S,,...,5,. The appropriate normalizing factor is n, and the famous arc 
sine theorem states 


p(™ < x) xe arc sine \/x. 
n m 


See Spitzer’s book [130], Chapter 4, for a complete discussion in the lattice 
case, or Feller [59, Vol. II, Chapter 12], for the general case. 


CHAPTER 11 


MULTIDIMENSIONAL CENTRAL LIMIT 
THEOREM AND GAUSSIAN 
PROCESSES 


1. INTRODUCTION 
Suppose that the objects under study are a sequence X, of vector-valued 


variables (X{"),..., X™) where each X™, j = 1,...,k, is a random 
variable. What is a reasonable meaning to attach to 


x, 2>xX, 


where X = (X,,...,X,)? Intuitively, the meaning of convergence in 
distribution is that the probability that X, is in some set B € B, converges 
to the probability that X is in B, that is, 


P(X, E€ B) > P(X € B). 


But when we attempt to make this hold for all B € 8,, difficulty is encountered 
in that X may be degenerate in part. In the one-dimensional case, the one- 
point sets to which the limit X assigned positive probability gave trouble and 
had to be excluded. In general, what can be done is to require that 


P(X, € B) > P(X € B) 


for all sets B € B, such that P(X € bd(B)) = 0. 

The definition we use is directed at the problem from a different but 
equivalent angle. Let & be the class of all continuous functions on R® 
vanishing off of compact sets. 


Definition 11.1. The k-vectors X,, converge in distribution (or in law) to X if 
for every f(x) E€ &, 

Ef(X,) > Ef(X). 
This is written as X,, a X or £(X,,) + £(X). In terms of distribution functions, 
F, 2+ F if and only if 


f fF (dx) > f f(x) F(dx) 


for all f E€ &. 
233 
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By considering continuous functions equal to one on some compact set and 
vanishing on the complement of some slightly larger compact set conclude 


that if F,, DF , and F,, 25G , then F = G. Define as in the one-dimensional 
case: 


Definition 11.2. Let N, be the set of all distribution functions on R™®. A set 
£ c WN’, is mass-preserving if for any « > 0, there is a compact set A such that 
F(A’) < «, forall Fel. 


If F,, MF , then {F,,} is mass-preserving. From this, conclude that F, LaF 
if and only if for every bounded continuous function f(x) on R®, 


Í fœ)F, (dx) > [reorcen. 


For any rectangle S such that F(bd(S)) = 0, approximate ys above and below 
by continuous functions to see that F,(S) > F(S). Conversely, approximate 
f faF,, by Riemann sums over rectangles such that F(bd(S)) = 0 to conclude 


that F, 2, Fis equivalent to 
F,(S) > F(S), if F(bd(S)) = 0. 


There are plenty of rectangles whose boundaries have probability zero, 
because if P(X e S) = P(X, €/,,..., X, E€ 4), then 


P(X € bd(S)) < ¥ P(X; € bd(I,)). 


By the same approximation as in 8.12, conclude that for f(x) bounded, 
B -measurable and with its discontinuity set having F-measure zero, that 


Í fdF,—> Í faF. 
From this, it follows that if B is in B, and F(bd(B)) = 0, then F,(B) > F(B). 


Problem 1. Is F, LF equivalent to requiring 
F(x, e o + Xe) > FOX, ~~ Xe) 


at every continuity point (x,,..., X) of F? Prove or disprove, 


2. PROPERTIES OF Nx 


The properties of k-dimensional probability measures are very similar to 
those of one-dimensional probabilities. The results are straightforward 
generalizations, and we deal with them sketchily. The major result is the 
generalization of the Helly-Bray theorem. 


Theorem 11.3. Let {F,} © N, be mass-preserving. Then there exists a 
subsequence F, converging in distribution to some F € Ny. 
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Proof. Here is a slightly different proof that opens the way for generaliza- 
tions. Take {/;} © & to be dense in &, in the sense of uniform convergence. 
To verify that a countable set can be gotten with this property, look at the 
set of all polynomials with rational coefficients, then consider the set gotten 
by multiplying each of these by a function Ay which is one inside the k- 
sphere of radius N and zero outside the sphere of radius N + 1. Use di- 
agonalization again; for every j, let J; be an ordered subset of the positive 
integers such that f f, dF, converges as n runs through J,, and I1 € J. 
Let Am be the mth member of /,,; then the limit of f f; dF, exists for all j. 
Take f € &, |f — f;| < € so that 


<e, 


| [ion [fd on 


hence lim f fdF, exists for all fe &. Because {F,} is mass-preserving, 
lim f f dF, = J(f) exists for all bounded continuous f. Denote the open 
rectangle {(x,,-..,X,)3 X1 < Yi- -Xp < Vx} by S(y). Take g, bounded 
and continuous such that g, fT xg). Then J(g,) is nondecreasing; call the 
limit F(yı, . . -, Yx). The rest of the proof is the simple verification that 
F(y,, - - . , Yx) is a distribution function, and that if F(bd(S(y))) = 0, then 
F,(S(y)) > F(S(y)). 

Define sets & of N’,-separating functions as in 8.14, prove as in 8.15 and 
8.16 that {F,,} mass-preserving, f f dF,, convergent, all f € &imply the existence 


of an Fe N, such that F, — F. Obviously, one sufficient condition for 
a class of functions to be N,-separating is that they be dense in & under 
uc convergence of uniformly-bounded sequences of functions. 


Theorem 11.4. The set of complex exponentials of the form 
{exp [i@x, + +++ + uyx,)]}, we RY 
is N’,-separating. 


Proof. The point here is that for any f € &, we can take n so large that 
f(x) =0 on the complement of S, = {x;x,¢[—n, +n],j=1,..., k}. 
Now approximate f(x) uniformly on S, by sums of terms of the form 
exp [7i(m,x, + ° ++ + m,x,)/n], m, - - . , My, integers. The rest goes through 
as in Theorem 8.24. 


Definition 11.5. For u, x e R™, let (u, x) = S* u,x,, and define the char- 
acteristic function of the k-vector X = (X,,..., X,) as fx(u) = Ee”) or of 
the distribution function F(x) as 


f(a) = f e F(dx). 


The continuity theorem holds. 
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Theorem 11.6. For F, € N, having characteristic functions f,(u), if 

a) lim f,(u) exists for every u e R™, 

b) lim f,(u) = A(u) is continuous at the origin, 

then there is a distribution function F € N, such that F, d F and h(u) is the 
characteristic function of F. 


Proof. The only question at all is the analog of inequality 8.29. But this is 
simple. Observe that 


ie p a "Tra — cos v;x;) av | F(dx) 


é =f II(- SUES set) Rida) a 2 > + F(S9), 
where S, = {|x,| < L/u,..., |x,] < l/u}. For any function g(v,,..., Ux) 
on R™® define T,g to be the function 

Elti- -© Vj Q, Vipis -o -o Up) — RBC, -op Djs oe + yO) 


= ggv, eers TU, +e, Va). 
Then 


i 
firo — cos 0,;x,)F(dx) = Tp- -+ Tif (0u - -< , Ve), 


where f is the characteristic function of F. The function 7, --- T f(v) is 
continuous and is zero at the origin. Write the inequality as 


F(S) <a | = echia 
u J0 0 


Now, f (u) — h(u) implies 


im FAS) S 2 [ o S To Bh dv, 
u Jo 0 
and everything goes through. 


Problems 


2. If X,..., Xj” are independent for every n, and if for j fixed, 
xin 2, x 


prove that a 
D 
(Xia XKS Xira X) 


and X,,..., X, are independent. 
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3. Let X™ ->X and Qis -. -Ọm be continuous functions on R®™, 
Show that 
D 
(PX, -> P(X) —> (PX), « -< p Pm(X))- 


4. Show that the conclusion of Problem 3 remains true if pẹ}, k = 1,..., m, 
continuous is replaced by 9,(x), k = 1,...,m a.s. continuous with respect 
to the distribution on R™ given by X = (X,,..., X,). 
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To use the continuity theorem, estimates on f(u) are needed. Write 


|x| = J x3. 


Proposition 11.7. Let E |X|” < œ, then 


f=] = E{(u, X)"] + ôu) llul”, 


=0 
where d(u) — 0 as u — 0. 


Proof. Write 
etx) — > l (u, x)” + (u, x) p(u, x), 
7 m! n! 
where 
g(u, x) = cos [6,(u, x)] + isin [8.(u, x)] — 1, 


6,, Ô, real and IN| < 1, 16.) < 1. By the Schwarz inequality we have 
|u, x)| < [lull [|x|]. Thus 
|E[(u, X)"p(u, X)]| < lul” E(IXI” leu, X))). 


The integrand is dominated by the integrable function 3 ||X|]", and y(u, X)->0 
as u— 0 for all œ. Apply the bounded convergence theorem to get the 
result. 


Definition 11.8, Given a k-vector (X,,..., X), EX; =0, j= 1,..., k. 
Define the k x k covariance matrix I by 


Iy = EX,X 


a? <j? 


ij=1l,...,k. 


Definition 11.9. The vector X, EX; = 0, j = 1,..., k, is said to have a joint 
normal distribution N (0, T) if 


Sx(u) = exp [=i > Dius). 
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Theorem 11.10. Let X,, Xo,... be independent k-vectors having the same 
distribution with zero means and finite covariance matrix T. Then 


Xi ++ Xn? 00, D). 
Jn 


Proof 

E(exp if(u, X, + +--+ + X,)/,/n]) = [E exp [i(u, X)/,/n]", 
where X has the same distribution as X,, X2, . .. By assumption, £ xX} < œ, 
j=i,...,k, where X = (X,,..., X,). Thus E |X|? < œ. By Proposition 
11.7, 

fu) = 1 — }E[(u, X)}] + ô(u) juj’, 
so 


Eet” = ful’) = 1 — (1/2n)E[(u, X)?] + (t/n) ô(u/vV'n) llul? 
(Fa = [! = s GE[(u, X)?] + 6(u//) Iui®)| j 


Asé —1/2E((u,X)?y = eH Tuuu, 


Once more, as long as second moments exist, the limit is normal. There 
are analogs here of Theorem 9.2 for the nonidentically distributed case which 
involve bounds on £ ||X,,||3. But we leave this; the tools are available for as 
many of these theorems as we want to prove. 


4. THE JOINT NORMAL DISTRIBUTION 
The neatest way of defining a joint normal distribution is 


Definition 11.11. Say that Y = (Y,,..., Y,), EY; = 0, has a joint normal 
(or joint Gaussian) distribution with zero means if there are k independent 
random variables X = (M,..., X), each with NO, 1) distribution, and a 
k x n matrix A such that 

Y = XA. 


Obviously, the matrix A and vector X are not unique. Say that a set 
Z,,..., Z, of random variables is linearly independent if there are no real 
numbers «,,..., @, not all zero, such that 

&Zı +: + Z= 0 as. 


Then note that the minimal k, such that there is an A, X as in 11.11, with 
Y = XA, is the maximum number of linearly independent random variables 
in the set Y,..... Y, If 
Sa,Y,=0 as, 
then = 
dal, =0, i=I,...,0, 
j-l 
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so the minimum k is also given by the rank of the covariance matrix I’. 
Throughout this section, take the joint normal distribution with zero means 
to be defined by 11.11. We will show that it is equivalent to 11.9. 

If Y = XA, then the covariance matrix of Y is given by 


EY; = E(S Xian) (z Xuan) =Y apay; 
k m k 

So 

T = #4. 
This is characteristic of covariance matrices. 
Definition 11.12. Calla square matrix M symmetric if m,; = m,,, nonnegative 
definite if aMa' > 0 for all vectors a, where a‘ denotes the transpose of a. 
Proposition 11.13. Ann x n matrix I is the covariance matrix of some set of 
random variables Y,,..., VY, if and only if 
1) T is symmetric nonnegative definite, 
equivalently, 
2) there is a matrix A such that 

T= A‘A, 


Proof. One way is easy. Suppose (l, = EY,Y,;; then obviously I’ is 
symmetric, and Ea, l,a; = E(ua,Y;)* > 0. For the converse, we start with 
the well-known result (see [63], for example) that for symmetric and non- 
negative definite, there is an orthogonal matrix O such that 
oro = D, 
where D is diagonal with nonnegative elements. Then taking B to be diag- 
onal with diagonal elements the square roots of the corresponding elements 
D gives B = D'D so that 
T = (O'D')(DO). 
Take A = DO, then = A'A. Now take X with components independent 
(0, 1) variables, Y = XA to get the result that the covariance matrix 
of Y is r. 
If Y has joint normal distribution, Y = X4, then 


(u, ¥) = Du; = 2 U ;X;ayj 
3 Ji 


Since the X, are independent N (0, 1), 
Ee) = E exp [i > X,(5 u aus) | 
T J 
z 
= expl —4> (2ra) | = exp [—}uA‘Au']. 
j 


r 
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Hence the Y vector has a joint normal distribution in the sense of 11.9. 
Furthermore, if Y has the characteristic function e—/2"Fu‘, by finding A 
such that [ = A'A, taking X to have independent N(0, 1) components 
and Ÿ = XA, then we get ¥ 2 Y. 

We can do better. Suppose Y has characteristic function e~ 1/2", 
Take O to be an orthogonal matrix such that O[Ot = D is diagonal. 
Consider the vector Z = YO. 


E(Z,Z,) = E( 2 Yaon m0m) = Y Okil km0 mi 

So km kım 
0, i#žj, 

Į i=j. 
Thus, the characteristic function of Z splits into a product, the Z,, Z.,... 
are independent by 8.33, and Z, is N(0, d,,). Define X, = 0 if d,; = 0, 
otherwise X, = Z,/d,,, Then the nonzero X, are independent N (0, 1) 
variables, and there is a matrix A such that Y = XA. 

A fresh memory of linear algebra will show that all that has been done 
here is to get an orthonormal basis for the functions Y,,...,Y,. This 
could be done for any set of random variables Y,,..., Y,, getting random 
variables X,,..., X, such that EX? = 1, EX,X; = 0, i# j. But the 
variables Y,,..., Y,, having a joint normal distribution and zero means, 
have their distribution completely determined by I‘ with the pleasant and 
unusual property that EY,;Y, = 0 implies Y, and Y, are independent. 
Furthermore, if /, = (i,..., i), 42 =(,,--+sjm) are disjoint subsets 
of(1,...,a)andT,, = 0,i¢ i,j Eh, then it follows that (Y,,,..., Y;,) and 


(Yi -+> Y;,,) are independent vectors. 


E(Z,Z;) = | 


1? 


We make an obvious extension to nonzero means by 
Definition 11.14. Say that Y = (Y;,..., Yn), with m = (EY,,..., EY,), 
has a joint normal distribution N(m, L) if (Yı — EYy,..., Yn — EY,) has 
the joint normal distribution N (0, T). 


Also, 


Definition 11.15. If Y = (Y,,..., Y,) has the distribution N(0, T), the 
distribution is said to be nondegenerate if the Y,,..., Y, are linearly in- 
dependent, or equivalently, if the rank of I is n. 


Problems 
5. Show that if Y,,..., Y, are W(0, I), the distribution is nondegenerate if 
and only if there are n independent random variables X,,..., X,, each 


NO, 1), and ann x n matrix A such that det (A) =Æ 0, and Y = XA. 
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6. Let Y,,..., Y, have a joint normal nondegenerate distribution. Show 
that their distribution function has a density f(y) given by 


fy) = exp [~ 4yr y]. 


1 
(27) /det (T) 
7. Show that if the random variables Y,,..., Y, have the characteristic 
function exp [—4uHu‘] for some n x n matrix H, then H is the covariance 
matrix of Y and EY; = 0,7 = 1,...,7. 


5. STATIONARY GAUSSIAN PROCESS 


Definition 11.16. A double ended process ..., Xi, Xo Xi... is called 
Gaussian if every finite subset of variables has a joint normal distribution. 


Of course, this assumes that E |X,,| < œ for all n. When is a Gaussian 
zero-mean process X stationary? Take ['(m,n) = EX,X,. Since the 
distribution of the process is determined by I'(m, n), the condition should be 
on [(m, n). 


Proposition 11.17. X is stationary if and only if 
Tm, n) = T(m — n, 0), all m,n. 


Proof. If X is stationary then EX,,X, = EXm-nXn-n = [(m — n, 0). 
Conversely, if true, then the characteristic function of X,,..., X, is 


1 ; 
exp (; È uy l(k — i,0)). 
k,j=1 


But this is exactly the characteristic function of Xy4m,---5 Xnim: 
Use the notation (loose), 


P(n) = T(n, 0) 


and call I\(#) the covariance function. Call a function M(n) on the integers 
nonnegative definite if for any finite subset of the integers, and a,, j € J, 
any real numbers, 


Y aya ,M(k — j) > 0. 
k,jeI 


Clearly a covariance function is nonnegative definite. Just as in the finite case, 
given any symmetric nonnegative definite function H(7) on the integers, we 
can construct a stationary zero-mean Gaussian process such that EX,,X, = 
H(m — n). 

How can we describe the general stationary Gaussian process? To do 
this neatly, generalize a bit. Let ..., X1, Xo X1,... be a process of 
complex-valued functions X; = U, + iV,, where U,, V; are random variables, 
and EU, = EV, = 0, all j. Call it a complex Gaussian process if any finite 
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subset of the {U,,V,} have a joint normal distribution, stationary if 
EX,,X, = (m—n). The covariance function of a complex Gaussian 
stationary process is Hermitian, ['(—n) = [(n) and nonnegative definite in 
the sense that for a subset J of the integers, and «; complex numbers, 


J > 0. 


Consider a process that is a superposition of periodic functions with 
random amplitudes that are independent and normal. More precisely, let 


È aak — j) = E( YuX, 
kjel kel 


Ais...» Ax be real numbers (called frequencies), and define 
k 
(11.18) X, =F", 
j=1 
where Z,,..., Z, are independent N (0, of) variables. The {X,} process is a 


complex Gaussian process. Further, 
EX, Xm = > > eine Aim EZ Z, 
jm 
ie Y gerne, 
j 


so the process is stationary. The functions e's” are the periodic components 
with frequency 4;, and we may as well take A; €(—7, +7). The formula 
(11.18) can be thought of as representing the {X,} process by a sum over 
frequency space A, A e [~—7, +7). The main structural theorem for complex 
normal stationary processes is that every such process can be represented as an 
integral over frequency space, where the amplitudes of the various frequency 
components are, in a generalized sense, independent and normally distributed. 


6. SPECTRAL REPRESENTATION 
OF STATIONARY GAUSSIAN PROCESSES 


The main tool in the representation theorem is a representation result for 
covariance functions. 


Herglotz Lemma 11.19 [70]. T(n) is a Hermitian nonnegative definite function 
on the integers if and only if there is a finite measure F(B) on &,([—z, +7)) 
such that 


T(n) = f e™4F (dd). 


Proof. One direction is quick. If 


T(n) = | e™F (di), 
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then T(n) is Hermitian, and 
S ye Fk — j) = Í |E a,e%*4/2F(da) > 0. 
k,jeI 


To go the other way, following Loève [108, p. 207], define 


n—l 
na =z 2 (1 - Erom 
T k=—nt+1 n 
=F SITY — meem > o, 
2an j=1 m=1 


Multiply both sides by e for —n + 1 <I <n -— 1, and integrate over 
[~7, +7] to get 


(1 = r r() = Í e'f (A) da, 


r(0) = [nm dì. 


Define F,,(dA) as the measure on [—7, +7] with density f(A), and take n’ 
a subsequence such that F, >. F. Then 


T() = lim Í nF (dh) 


= f én F(d}). 


Now take the mass on the point {7} and put it on {—7} to complete the 
proof. 


Note that the functions {e'"} are separating on [—z, 7), hence F(dA) is 
uniquely determined by T(n). For T(n) the covariance function of a complex 
Gaussian stationary process X, F(dd) is called the spectral distribution 
function of X. To understand the representation of X, an integral with 
respect to a random measure has to be defined. 


Definition 11.20. Let {Z(A)} be a noncountable family of complex-valued 
random variables on (Q, F, P) indexed by he [—7m, +7). For l an interval 
[is As), define Z(I) = Z(A,) — Z(A,). If the Riemann sums © f(A,)Z(1,), 
I,,...,2,,a disjoint partition of [— 7, +7) into intervals left-closed, right-open, 
A, E€ lą, converge in the second mean to the same random variable for any 
sequence of partitions such that max |I,| — 0, denote this limit random variable 
by k 


f f(AZ(dd). 


Now we can state 
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Theorem 11.21. Let X be a complex Gaussian stationary process on (Q, F, P) 
with spectral distribution function F(d2). Then there exists a family {Z(A)} of 
complex-valued random variables on (Q, F, P) indexed by A € [—za, +77) such 
that 


i) for any Ay,..., Am Z(Ay),..., Z(A,) have a joint normal distribution, 
ii) for I, lh, disjoint, EZ(U,)Z(,) = 0, 

ii) E|Z(D|? = FC), for all intervals I, 

iv) X, = f e'"*Z(dA), a.s. all n. 


Proof. The most elegant way to prove this is to use some elementary Hilbert 
space arguments. Consider the space £(X) consisting of all finite linear 
combinations 


2 Xk» 


where the «, are complex numbers. Consider the class of all random variables 
Y such that there exists a sequence Y, ef with Y, —> Y. On this class 
define an inner product (Y, Y3) by EY,Y,. Call random variables equivalent 
if they are a.s. equal. Then it is not difficult to check that the set of equivalence 
classes of random variables forms a complete Hilbert space L,(X) under the 
inner product (Y,, Y,). Let L(F) be the Hilbert space of all complex-valued 
B,([—7, +7)) measurable functions f(A) such that f |f(A)|?F(dd) < œ 
under the inner product (f, g) = f fg dF (take equivalence classes again). 

To the element X, € LX), correspond the function e'”* e L,(F). 
Extend this correspondence linearly, 


E uX, o> Za, 


Let £(F) be the class of all finite linear combinations X «,e*4, Then 
Yi, Ye € £(X), fig € £(F), 


and Yı > f, Ya +> g implies «Y, + BY. <> af + Bg, and 


(Yi Yo) = E PXE PX) = Lala? TYG — k) 
= | 2 PaP EPEAN = [FEF = Q, 8) 


If Y„E€£(X) and Y, —> Y, then Y,, is Cauchy-convergent in the second 
mean; consequently so is the sequence f, <> Y,. Hence there is an f € L.(F) 
such that f, 4> f. Define Y<+ f; this can be checked to give a one-to-one 
correspondence between L,(F) and L,(X), which is linear and preserves 
inner products. 
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The function y,_, (4) isin L,(F); let Z(é) be the corresponding element 
in L,(X). Now to check that the family {Z(&)} has the properties asserted. 


Begin with (i). If real random variables Y‘”) male Yœ kK =1,...,m, and 
(¥,..., YŒ) has a joint normal distribution for each n, then (Y;,..., Ym) 


is joint normal. Because each element T® = EY" Y™ of the covariance 
matrix converges to [D ,; = EY,Y,;; hence the characteristic function of 
Y,, converges to e~“/2"Tu!” and this must be the characteristic function of Y. 
Conclude that if Y,,..., Ym are in L,(X), then their real and imaginary 
components have joint normal distributions. Thus for &,,...,&, in 
[—7, 7), the real and imaginary components of Z(&,),..., Z(&m) have a 
joint normal distribution. For any interval Z = [&, 6), Z(/)<> 7/(A). 
Hence for /,, J, disjoint, 


ELECT) = | 11, Bar AFAR = 0. 
Also, 
E IZ()}? = Í DFAA = [ x1AFUAA) = FO. 
Lastly, take f(A) to be a uniformly continuous function on [—7, 7). For a 
partition of [—7, 7) into disjoint intervals 7,4, . . . , Z, left-closed, right-open, 
and A, E Íy, 
DSZ) > Efn). 

The function on the right equals f(A,) on the interval Z,, and converges 
uniformly to f(A) as max, |Z| — 0. So, in particular, it converges in the 


second mean to f(A). If Ye f(A), then & f(A,)Z(/,) > Y. From the 
definition 11.20, 


Y= f f(AZ(dd). 
For f(A) = e’", the corresponding element is X,,, thus 
X, = f e'™Z(dd) a.s. Q.E.D. 


Proposition 11.22. If {X,} is a real stationary Gaussian process, then the 
family {Z(A)}, A € [—7, +7) has the additional properties: If {Z,(A)}, {Z4)} 
are the real and imaginary parts of the {Z(A)} process 

Z(A) = Z,(A) + Z,(A), 
then 
i) For any two intervals I, J, 


EZ,(DZA(J) = 0. 
ii) For any two disjoint intervals I, J, 
EZ (DZ (J) = 0, EZ,D)Z(J) = 0. 
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Proof. Write any intervals J, J as the union of the common part J OJ and 
nonoverlapping parts, and apply 11.21(1i) and (iii) to conclude that the 


imaginary part of EZ(/)Z(J) is zero. Therefore, 
(11.23) EZ (DZI) + EZ(NZ,(J) = 0. 


Inspect the correspondence set up in the proof of 11.21 and notice that if 
Y = La,X, and Yo f(A), then Y = E &,X, corresponds to f(—A). This 
extends to all Z,(X) and L,(F). Hence, since 7;(A)<> Z(J), then y_;(A) 
Z(D, so 

Z(-1) = ZW) a.s., 


where —/ = {A; —AeJ}. From this, 
Z(—/) = Z,(), Z(—1 = —Z,(J). 


Thus, if we change J to — Z in (11.23), the first term remains the same, and the 
second term changes sign; hence both terms are zero. For J, J disjoint, 
EZ(DZ(J) = 0 implies EZ,(/)Z,(J) = EZ,(/)Z,(J). We can use the sign 
change again as above to prove that both sides are individually zero. Q.E.D. 


For a real stationary Gaussian process with zero means, we can deduce 
from 11.22(i) that the processes {Z,(A)}, {Z,(A)} are independent in the sense 
that all finite subsets 


{Z,(7)}, j = 1, oe M, {Za}, k = l, cee M, 


are independent. From 11.22 (ii), we deduce that for /,,...,/, disjoint 
intervals, the random variables Z,(/,),..., Z,(/,) are independent. Similarly 
for Z,(/,),..-, ZU). 


7. OTHER PROBLEMS 


The fact that F(dA) completely determines the distribution of a stationary 
Gaussian process with zero means leads to some compact results. For 
example, the process X is ergodic if and only if F(dA) assigns no mass to any 
one-point sets [111]. 

The correspondence between L,(X) and L,(F) was exploited by Kol- 
mogorov [96, 97] and independently by Wiener [143] in a fascinating piece 
of analysis that leads to the solution of the prediction problem. The starting 
point is this: the best predictor in a mean-square sense of X, based on 
Xo, X_1,-.. is E(X | Xo, X_y,...). But for a Gaussian process, there are 
constants a4”) such that 


E(X; | Mop ee ha) = Sal Xp 
0 
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Because by taking the «{”) such that 


E| (x = $ axa) X E A EEA 
0 
or 


n 
= Dx a Meme j=0,...,—n, 
0 


2? 


then X, — > a\”%_, is independent of (Xp, X_,,..., X_,), so that 
1 


E(X: — Sal X__ Xo Xa, X-a) =0. 
0 
From the Martingale theorems, since EX? < œ, it is easy to deduce that 


E(X: Ky KS BOG Ke): 


Hence the best predictor is in the space L.(Xp, X,, . . .) generated by all linear 
combinations of Xo, X_,,.... By the isomorphism this translates into the 
problem: Let L;(F) be the space generated by all linear combinations of 
ee*h, k = 0, —1, —2,... Find the element f(A) € LZ(F) which minimizes 


Í le? — f(A)? F(A). 


In a similar way, many problems concerning Gaussian processes translate 
over into interesting and sometimes well-known problems in functions of a 
real variable, in particular, usually in the area of approximation theory. 


NOTES 


Note that the representation theorem 11.21 for Gaussian processes depends 
only on the fact that EX,X,, depends on the difference n — m. Define 
{X,},n = 0, +1,... to be a complex process stationary in the second order 
if T(n, m) = EX,X%,, is a function of n — m. The only difference in the 
conclusion of 11.21 is that (i) is deleted. This representation theorem was 
proved by Cramér [21, 1942], and independently by Loéve [107]. Since 
the work on the prediction problem in 1941-42 by Kolmogorov [96] and 
[97], and independently by Wiener [143], there has been a torrent of publica- 
tions on second-order stationary processes and a sizeable amount on 
Gaussian processes. For a complete and rigorous treatment of these matters, 
refer to Doob’s book [39]. For a treatment which is simpler and places more 
stress on applications, see Yaglom [144]. 


CHAPTER 12 


STOCHASTIC PROCESSES 
AND BROWNIAN MOTION 


1. INTRODUCTION 


The natural generalization of a sequence of random variables {X,} is a 
collection of random variables {X,} indexed by a parameter ż¢ in some interval 
I. Such an object we will call a stochastic process. 


Definition 12.1, A stochastic process or continuous parameter process is a 
collection {X,(w)} of random variables on (Q, F, P) where t ranges over an 
interval I< R™, Whenever convenient the notation {X(t,w)} or simply 
{X(t)} will be used. 


For fixed w, what is produced by observing the values of X(t, w) is a function 
x(t) on I. 

The most famous stochastic process and the most central in probability 
theory is Brownian motion. This comes up like so: let X(t) denote one 
position coordinate of a microscopic particle undergoing molecular bom- 
bardments in a glass of water. Make the three assumptions given below, 


Assumptions 12.2 
1) Independence: X(t + At) — X(t) is independent of {X(7)},7 St 
2) Stationarity: The distribution of X(t + At) — X(t) does not depend on t, 


PN P(X + At) — X(t)| > ô) 2 


0, forallô > 0. 
atta At 


3) Continuity: 


This is the sense of the assumptions: (1) means that the change in 
position during time [7, t + At] is independent of anything that has happened 
up to the time t. This is obviously only a rough approximation. Physically, 
what is much more correct is that the momentum imparted to the particle 
due to molecular bombardments during [t,t + At] is independent of what 
has happened up to time ¢. This assumption makes sense only if the 
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displacement of the particle due to its initial velocity at the beginning of the 
interval (t,t + At] is small compared to the displacements it suffers as a 
result of molecular momentum exchange over [t,t + At]. From a model 
point of view this is the worst assumption of the three. Accept it for now; 
later we derive the so-called exact model for the motion in which (1) will be 
replaced. The second assumption is quite reasonable: It simply requires 
homogeneity in time; that the distribution of change over any time interval 
depend only on the length of the time interval, and not on the location of the 
origin of time. This corresponds to a model in which the medium is considered 
to be infinite in extent. 

The third assumption is interesting. We want all the sample functions 
of our motion to be continuous. A model in which the particle took in- 
stantaneous jumps would be a bit shocking. Split the interval (0, 1] into n 
parts, Ar = I/n. If the motion is continuous, then 


h(At) = sup |X(k At) — X((k — 1) At)| 
1SkSn 
must converge to zero as At — 0. At a minimum, for any ô > 0, 
lim P(A(At) > ô) = 0. 
Atto 


By (1) the variables Y, = |X(k At) — X((k — 1) At)| are independent; 
by (2), they all have the same distribution. Thus 


(12.3) P(h(At) > 6) =1— P( sup Y, < ò) = 1 — [P(Y, < ô)]” 


1SkSn 
=1-f1-—P(Y¥,>9)!"r1- enn PM) 
so that P(h(At) > 6) —> 0 if and only if nP(Y, > 6) > 0, This last is exactly 


lim P(X(t + a= X(t)| > ô) = 


0. 


Make the further assumption that X(0) = 0. This is not a restriction, but 
can be done by considering the process X(t) — X(0), t > 0, which again 
satisfies (1), (2), and (3). Then 


Proposition 12.4. For any process {X(t)}, t > 0 satisfying 12.2. (1), (2), and (3) 
with X(0) = 0, X(t) has a normal distribution with EX(t) = ut, o?(X(t)) = 
oft. 


Proof. For any t, let At = tjn, Y, = X(k At) — X((k — 1) At). Then 
X(t) = Y; +--+ + Y,, where the Y,,..., Y, are independent and identi- 
cally distributed. Therefore X(t) has an infinitely divisible law. Utilize the 
proof in (12.3) to show that M, = max, <,<, |Y,{ converges in probability to 
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zero. By 9.6, X(t) has a normal distribution. Let y,(t) = EX(t). Then 
gilt + 7) = EX(t + 7) 
= E(X(t +7)— X(7)) + EX(r) 
= p(t) + 9,(7). 
p(t) = 07(X(1), 
so that 
pt + 7) = E(X + 7) — pt + 7} 
= E(X(t + 7) — XO — a) + XO — pF 
= p(t) + p(T). 


The fact is now that p,(t) and p(t) are continuous. This follows from 
X(t + 7) 2e X(t), as 7 — 0, which for normal variables implies 


EX(t + 1) > EX), P(X + 2) > P(X). 


Let 


It is easy to show that any continuous solutions of the equation y(t + 7) = 


g(t) + (7) are linear. 
Use the above heuristics to back into a definition of Brownian motion. 


Definition 12.5. Brownian motion is a stochastic process on [0, co) such that 
X(0) = 0 and the joint distribution of 


X(tn), <- +» X(lo)s t, > bn pea > to > 0, 
is specified by the requirement that X(t,) — X(t;ı), k = 1,..., n be inde- 
pendent, normally distributed random variables with 
E(X(t,) = X(t_1)) = (fy — hat, 
o? (X(t) = X(t,_1)) = (te — t1)0*. 
This can be said another way. The random variables X(t,), ..., X(to) have a 
joint normal distribution with EX(¢,) = yt, and 
V(t, h) = E(X) — ut) (XG) — ate) 
= ELX) — X) — wt; — t) + XG) — Htr) (X(t) = bt,)| 
= ot, fort; > tp 
so that I'(¢,, t) = o?(min (t;, ¢,)). 


Problem 1. Show that Brownian motion, as defined by 12.5, satisfies 12.2 
(1), (2), and (3). 
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2. BROWNIAN MOTION AS THE LIMIT OF RANDOM WALKS 


There are other ways of looking at Brownian motion. Consider a particle 
that moves to the right or left a distance Ax with probability 3. It does this 
each At time unit. Let Y,,... be independent, and equal + Ax with proba- 
bility 4 each. The particle at time ¢ has made [/At] jumps ({z] indicates 
greatest integer < z). Thus the position of the particle is given by 


Dit) = Yi + ee + Yiyan. 


The idea is that if Ax, At —> 0 in the right way, then D(t) will approach 
Brownian motion in some way. To figure out how to let Ax, At — 0, note 
that ED?(t) ~ (Ax)?t/Ar. To keep this finite and nonzero, Ax has to be of the 
order of magnitude of VAt. For simplicity, take Ax = J At. Take At = 1 In, 
then the Y,, . . . equal +1 [/n. Thus the D(t) process has the same distribution 
as 


x(t) = Zi H oy Zint 
Jn 


where Z,,... are +1 with probability 4. 
Note that 


xat =./ 1 — [nt] 
Dega aa 


and apply the central limit theorem to conclude X‘"(t) 2; N (0, t). In 
addition, itis no more difficult to show that all the joint distributions of X(t) 
converge to those of Brownian motion. Therefore, Brownian motion appears 
as the limit of processes consisting of consecutive sums of independent, 
identically distributed random variables, and its study is an extension of 
the study of the properties of such sequences. 

What has been done in 12.5 is to specify all the finite-dimensional 
distribution functions of the process. There is now the question again: 
Is there a process {X(t)}, ¢ € [0, œ) on (Q, F, P) with these finite-dimensional 
distributions? This diverts us into some foundational work. 


3. DEFINITIONS AND EXISTENCE 


Consider a stochastic process {X(t)}, t € Jon (Q, F, P). For fixed œw, X(t, œ) 
is a real-valued function on J. Hence denote by R! the class of all real-valued 
functions x(t) on I, and by X(w) the vector variable {X(t,w)} taking values in R'. 


Definition 12.6. F(X(s), s €J), J < I, is the smallest o-field F such that all 
X(s), s E J, are ¥-measurable. 


Definition 12.7. A finite-dimensional rectangle in R! is any set of the form 
S = {x() € RB; x(t) € hs. o., x(tn) € Inh 
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where I,,..., I, are intervals, Let By be the smallest o-field of subsets of R! 
containing all finite dimensional rectangles. 


For the understanding of what is going on here it is important to char- 
acterize Bz. Say that a set Be B, has a countable base T = {t,} if it 
is of the form 


B = {x(-); (x(q), x(4),..-)€ D}, De B,. 
This means that B is a set depending only on the coordinates x(t,), x(f2), ... 


Proposition 12.8. The class C of all sets with a countable base forms a o-field, 
hence C = B. 


Proof. Let B,, Bo,...¢€C, and take T, as the base for B,, then T = U, Ty 
is a base for all B,, B,,..., and if T = {t;}, each B, may be written as 


B, = {x(-); (x(t), -.-) € Dy}, Dy € Bo- 


Now it is pretty clear that any countable set combinations of the B, 
produce a set with base 7, hence a set in C. 


Corollary 12.9. For B € B,, {w; X(w) € By e F. 

Proof. By the previous proposition there is a countable set {t,} such that 
B = {x(-); (x(4),...) € D}, De By. 

Thus {w; X(w) € B} = {w; (X(t), ...) € D}, and this in in ẸF by 2.13. 

Definition 12.10. The finite-dimensional distribution functions of the process 

are given by 

Faea ty oe es Xa) = P(X) < x... X(t.) <x) < te SS, 

The notation F(x) may also be used. 


Definition 12.11. The distribution of the process is the probability P on B 
defined by 

P(B) = P(X e€ B). 
It is easy to prove that 


Proposition 12.12. Any two stochastic processes on I having the same finite 
dimensional distribution functions have the same distribution. 


Proof. 1f X(t), X’(t) are the two processes, it follows from 2.22 that X(4), 
X(t), . . . and X’(t,), X'(ta), . . . have the same distribution. Thus if B is any 
set with base {z4, tz, . . .}, there is a set D E€ B, such that 


P(X € B) = P((X(t,),...) € D) = P'((X'(t), ...) € D) = P(X’ € B). 
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The converse is also true, namely, that starting with a consistent set of finite 
dimensional distribution functions we may construct a process having those 
distribution functions. 


Definition 12.13. Given a set of distribution functions 
AF cigs aon Xa)h 


defined for all finite subsets {t, <+-+ < t,}of I. They are said to be consistent 
if 


lim Feck ane aati Ry) Ly epeari Oiss kp is Xh 
£pi © 


where the ^ denotes missing. 


Theorem 12.14. Given a set of consistent distribution functions as in 12.13 
above, there is a stochastic process {X(t)}, t € I, such that 


Fi ieran) = P(X(t1) < x1,.--5 X(th) < Xn). 


Proof. Take (Q, F) to be (R', B,). Denote by T, Tı, T, etc., countable 
subsets of Z, and by Br all sets of the form 


{x(-); (x(4),...)€ D}, DESB,, {t} = T. 
By the extension theorem 2.26, there is a probability Pp on Bp such that 
PxC); A4) < Xis oos Aln) < Hy) Fg (Bry 0 Xand 


Take B any set in B;, then by 12.8 there is a T such that B € By. We would 
like to define P on B; by P(B) = P7(B). To do this, the question is—is the 
definition well defined? That is, if we let Be Bp,, BE Br, is Pp (B) = 
Pr (B)? Now B € Br ur, hence it is sufficient to show that 


T < T', BE Bp > PAB) = Pr(B). 


But Br € Bp, and Pr = Pp on all rectangles with base in T; hence 
Py is an extension to Bp of Pp on Br, so Pr(B) = Pr(B). Finally, to 
show P is o-additive on %;, take {B,} disjoint in Bz; then there is a T such 
that all B,, B,,... are in By, hence so is U B,. Obviously now, by the ø- 


additivity of Py, 
P(U B,) = Pr(U B,) = 2 Pr(B,) = 2 PCB,). 


The probability space is now defined. Finish by taking 
X(t, x(-)) = x(t). 
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4. BEYOND THE KOLMOGOROV EXTENSION 


One point of the previous paragraph was that from the definition, the most 
complicated sets that could be guaranteed to be in F were of the form 


{w; (X(t), X(t), ...) E€ D}, De By. 
Furthermore, starting from the distribution functions, the extension to Bz 
is unique, and the maximal o-field to which the extension is unique is the 
completion, Bz, the class of all sets A such that A differs from a set in B; 


by a subset of a set of probability zero (see Appendix A.10). 
Now consider sets of the form 


A, = {X(t, @) = 0 at least once for t € 1}, 
A, = {|X()| < x, all te J}. 
These can be expressed as 
A, = U {X(t, w) = 0}, 


A: = N {IX(t, @)| < a}. 
teI 


If each X(t) has a continuous distribution, then A, is a noncountable union 
of sets of probability zero. A, is a noncountable intersection. Neither of 
A,, Ag depends on a countable number of coordinates because 


Ai = {X(t œ) #0 forall te /}. 


Clearly, A? does not contain any set of the form {(X(t,),...)€ B}, BE Ba. 
Thus 4$ is not of the form {X € B}, B € B;, so neither is A,. Similarly, A, 
contains no sets of the form {X € B}, Be B,. This forces the unpleasant 
conclusion that if all we are given are the joint distribution functions of the 
process, there is no unique way of calculating the interesting and important 
probabilities that a process has a zero crossing during the time interval J or 
remains bounded below « in absolute value during Z. (Unless, of course, 
these sets accidentally fall in z. See Problem 3 for an important set which is 
not in fz.) 
But a practical approach that seems reasonable is: Let 


Ave {int CO < e) 
tel 


and hope that a.s. 4, = lim,,,4,. To compute P(A,), for 7 = [0, 1] say, 
compute 


P, = P(int IX(k/n)| < e)» 
ken 
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and define P(A.) = lim, P,. Note that {inf, <, |X(k/mM|< «}€F, so its 
probability is well-defined and computable from the distribution functions. 
Finally, define P(A,) = lim,,, P(A,). This method of approximation is 
appealing. How to get it to make sense? We take this up in the next section. 


Problems 
2. Prove for Brownian motion that the fields 
F(X), te [a,b}) and F (X(t) — X(b), t € [b, cl), 
0<ax<b<c< œ, are independent. 
3. Let 
A = {x(:) e RŽ; x(-) is B(T) measurable}. 


Show by considering A and 4°, that A is never in B, for any probability 
P on &y. 


5. EXTENSION BY CONTINUITY 


We are going to insist that all the processes we deal with in this chapter have a 
very weak continuity property. 
Definition 12.15. Given a stochastic process {X(t)}, t € I, say that it is con- 
tinuous in probability if for every t € I, whenever t, — t, then X(t,) 2, X(t). 
When is X(t, w) a continuous function of t? The difficulty here again is 
that the set 
{w; X(t, w) continuous on 7} 


is not necessarily in F. It certainly does not depend on only a countable 
number of coordinates. However, one way of getting around the problem is 
to take T = {t,} dense in 7. The set 


F = {w; X(-, w) uniformly continuous on T} 
isin F. To see this more clearly, for h > 0 define 
U(h, œw) = sup (X(t, œ) — X(t; œ), t,, ET. 
|te—-ty]SA 
The function U(A, w) is the supremum over a countable set of random 
variables, hence is certainly a random variable. Furthermore, it is decreasing 


inh. If ash | 0, UH, œ) —*, 0, then for almost every w, X(t, w) is a uni- 
formly continuous function on T. Let C E F be the set on which U(h, w) — 0. 
Assume P(C) = 1. For œ € C, define X(t, w) to be the unique continuous 
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function on / that coincides with X(t, œ) on T. For te T, obviously 
X(t, œ) = X(t, w). For 1¢ 7, eC, X(t, w) = lim, X(t; œ). Define 
X(t, w) to be anything continuous for w € C°, for example X(t, w) = 0, 
all z € I, € C°. But note that X(t) +5 X(t) for t, € T, t, > t, which implies 
that X(t;) Es X(t) and implies further that X(t) = X(t) almost surely. 
When / is an infinite interval, then this same construction works if for any 
finite interval J < 7, there is probability one that X(-) is uniformly continuous 
on T NJ. Thus we have proved 


Theorem 12.16. If the process {X(t)}, t € I, is continuous in probability, and 
if there is a countable set T dense in I such that 


P(w; X(t, œw) is uniformly continuous on T N J) = 


for every finite subinterval J © I, then there is a process X(t, w) such that 
X(t, œ) is a continuous function of t € I for every fixed œ, and for each t, 


X(t, 0) = X (t, œw) as. 


The revised process {X(t, w)} and the original process {X(t, w)} have the same 
distribution, because for any countable {t,}, 

P(o; (X(4),--.) = (Xà), --)) = 0. 
Not only have the two processes the same distribution, so that they are 
indistinguishable probabilistically, but the {X(} process is defined on the 
same probability space as the original process. The {X(#)} process lends itself 
to all the computations and devices we wanted to use before. For example: 
for J = [0, 1], - 

A, = {w; 3t € I such that X(t, œ) = 0}, 
= {w; U e 7 such that |X(1, w)| < £}. 


It is certainly now true that 
A, = lim 4,. 


But take 4, , to be the set 4, , = {w; Ik < 2", such that [X(k/2")| < €}. 
Then 
A, EF, Åna Î Ag 


n, 


which implies 4, € F, so, in turn, A, € F. Furthermore, 
P(A.) = lim P(A,,,). 


Therefore, by slightly revising the original process, we arrive at a process 
having the same distribution, for which the reasonable approximation 
procedures we wish to use are valid and the various interesting sets are 


measurable. 
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Obviously not all interesting stochastic processes can be altered slightly 
so as to have all sample functions continuous. But the basic idea always is to 
pick and work with the smoothest possible version of the process. 


Definition 12.17. Given two processes {X(t)} and {X(t)}, t € I, on the same 
probability space (Q, F, P). They will be called versions of each other if 


P(X(t) = X(t) = 0, alltel. 


Problems 


4. Show that if a process {X(t)}, t € /, is continuous in probability, then for 
any set {t,} dense in J, each X(t) is measurable with respect to the completion 
of F(X(t,), X(t), .. .), or that each set of F (X(t), t € 7) differs from a set of 
¥ (X(t), .. .) by a set of probability zero. 

§. Conclude from the above that if T, © Tapis Ta t T, T, finite subsets of 
I, T dense in 7, then for J c IL AEF, 


P(A | X(t), te J) = lim P(A | XÐ, tE€J OT,), a.s. 


6. If X(t), X(0) are versions of each other for t € I, and if both processes have 
all sample paths continuous on J, show that 


P(X(t) = X(t) for allres) = 1. 


7. If {X(t)},t¢J, is a process all of whose paths are continuous on J, 
then show that the function X(t, w) defined on J x Q is measurable with 
respect to the product o-field B,(7) x F. [For I finite, let I,,..., J, be any 
partition of 7, t, € l, and consider approximating X(t) by the functions 


2 X(t)x7,():] 


6. CONTINUITY OF BROWNIAN MOTION 


It is easy to check that the finite-dimensional distributions given by 12.5 are 
consistent. Hence there is a process {X(t)}, £ > 0 fitting them. 


Definition 12.18. Let X(t) be a Brownian motion. If u # 0 it is said to be a 
Brownian motion with drift u. If p = 0, œ = 1, it is called normalized 
Brownian motion, or simply Brownian motion. 


Note that (X(t) — yt)/o is normalized. The most important single sample 
path property is contained in 


Theorem 12.19. For any Brownian motion X(t) there is a dense set T in 
[0, œ) such that X(t) is uniformly continuous on T O [0, a], a < œ, for 
almost every w. 
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In preparation for the proof of this, we need 
Proposition 12.20. Let Ty be any finite collection of points, O = tọ < 


tı <te < t, = 7; then for X(t) normalized Brownian motion 


P(max X(t) > x) < 2P(X(7) > x), 


tETo 


P (max IX(t)| > x) < 2P(\X(x)| > x). 


teTo 


Proof. Denote Y, = X(t) — X(4,-:), and j* = {first j such that X(t,) > x}. 
Then because X(t,) — X(t,) has a distribution symmetric about the origin, 


}P (max X(t) > x) Ş Špo = j)P(X(t,) — X(t,) > 0). 


teT'o 
Since {j* = j} € F(X(t),..., X(t,)), then 
4P(max X(t) > x) = SP(j* = j, X(t,) — X(t) > 0) 
teTo j=1 
<> P(j* = j, X(t) > x)< P(X(z) > x). 
izi 
For the second inequality, use 


(max IXI > x] 2 (max X(t) > x 


teTo 


V {min X(t) < =| 


teTo 
and the fact that — X(t) is normalized Brownian motion. 
Proof of 12.19. We show this fora = 1. Take T, = {k/2"; k = 0,..., 27}, 
and T = U 7,. Define 
1 


U, = sup |X(t;) — X(t). 
tjiteT 
(ts~ tel S12" 


To show that U, + 0 a.s., since U, € | it is sufficient to show that 


am PU, > 6) = 90. 


Let Z, = [k/2", (k + 1)/2"], and 
Y= up {X(t) — X(k/2")j, k =0,ł,...,2” —1. 


tel, 


By the triangle inequality, 
U, < 3 max Y, 
k 
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We show that P( max Y, > ô) — 0. Use 
k 


2"-1 
P(max Y> ò) = P(U {Y, > 3) < > PY, > 8). 
k k 0 
The Y,,... are identically distributed, so 


P (max Y> ô) < 2"P(Y, > ô). 
k 
Note that 

Yə = lim max |X(t)]. 


N téelynTv 
By 12.20, for N > n, 


P( max |X(t)| > 6) < 2P(|X(2-")| > 8), 


teIgnT nx 
hence 


P(Y9 > ô) < 2P(\X(2-”)| > ô). 
Since Brownian motion satisfies P(|X(A)| > 6)/At + 0 as At — 0, then 
2"P(|X(2-")| > ô) > 0, 
which proves the theorem. 


Corollary 12.21. There is a version of Brownian motion on {0, 0) such that all 
sample paths are continuous. 


Henceforth, we assume that the Brownian motion we deal with has all 
sample paths continuous. 


Problems 


8. Prove that for {X(t)} normalized Brownian motion on _ [0, œ), 
P(X(n6) € J i.o.) = 1 for all intervals J such that ||/|| > 0, and fixed ô > 0. 
9. Define 6 = N,- F(X(7), 7 > t), or AEE if AE F(X(n), 7 > t) for 
all t > 0. Prove that A e © implies P(A) = 0 or 1. [Apply a generalized 
version of 3.50.] 


7. AN ALTERNATIVE DEFINITION 


Normalized Brownian motion is completely specified by stating that it is a 
Gaussian process, i.e., all finite subsets {X(t,), ... , X(¢,,)} have a joint normal 
distribution, EX(t) = 0 and covariance I(s,t) = min(s,¢). Since all 
sample functions are continuous, to specify Brownian motion it would 
only be necessary to work with a countable subset of random variables 
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{X(¢)}, t € T, and get the others as limits. This leads to the idea that with a 
proper choice of coordinates, X(t) could be expanded in a countable coordinate 
system. Let Y,, Yo, . . . be independent and N°(0, 1). Let p,(t) be defined ona 
closed interval Z such that $} |p,(t)|? < œ, all t € 7. Consider 


Z(t) = È PAY y- 
Since : 


ŽE lol, = Š laol eT 


the sums converge a.s. for every t € J, hence Z(t) is well-defined for each t 
except on a set of probability zero. Furthermore, Z(t,),..., Z(t,) is the 
limit in distribution of joint normal variables, so that {Z(t)} is a Gaussian 
process. 

Note EZ(t) = 0, and for the Z(t) covariance 


(12.22) Ts, ) = > pote), (DEIXI. 
Hence if the ¢,(t) satisfy 
min (s, t) = > Pps), (s, jer x I, 
then Z(t) is normalized Brownian motion on 7. IJ assert that on J = (0, 7], 


(12.23) min (s, t) = ts + 2 > sin kt sin ks 
7 


Ww k21 k? 


One way to verify this is to define a function of t on [—7, m] for any s > 0, by 


t HI <s; 
h(t) = \-s, t<—s; 
Ss, tDs. 


Denote the right-hand side of (12.23) by g,(t). The sum converges uniformly, 
hence g,(t) is continuous for all s, ¢. Simply check that for all integers k, 


f eh (t) dt = e™'g (t) dt, 
(—7,+7] {-7,+7] 

and use the well-known fact that two continuous functions with the same 
Fourier coefficients are equal on [—7, +7]. Since h,(t) = min (s, t), for 
t > 0, (12.23) results. 


12.8 VARIATION AND DIFFERENTIABILITY 261 


Proposition 12.24, Let Yo, Y,,... be independent N (0, 1), then 


t 2 i t 
X()=—=Y +, E SS" y,, 
Jt nmm m 


is normalized Brownian motion on [0, 7]. 


One way to prove the continuity of sample paths would be to define X'(t) 
as the nth partial sum in 12.24, and to show that for almost all w, the functions 
x,(t) = X(t, w) converged uniformly on [0, m]. This can by shown true, 
at least for a subsequence Xr). See Ito and McKean [76, p. 22], for a 
proof along these lines. 


8. VARIATION AND DIFFERENTIABILITY 


The Brownian motion paths are extremely badly behaved for continuous 
functions. Their more obvious indices of bad behavior are given in this 
section: they are nowhere differentiable, and consequently of unbounded 
variation in every interval. 


Theorem 12.25. Almost every Brownian path is nowhere differentiable. 


Proof. We follow Dvoretski, Erdés, and Kakutani [42]. Fix f > 0, suppose 
that a function x(t) has derivative x’(s), |x'(s)| < £, at some point s € [0, 1]; 
then there is an ny such that for n > ny 

(12.26) |x(t) — x(s)| < 2B lt — sl, if |t — s| < 2n. 

Let x(:) denote functions on {0, 1]. 

A, = {x(-); Is such that |x(t) — x(s)| < 28 lt — sl, if |t — s| < 2/n}. 


The A,, increase with n, and the limit set A includes the set of all sample 
paths on [0, 1] having a derivative at any point which is less than f in absolute 
value. If (12.26) holds, then, and we let k be the largest integer such that 
kjn < s, the following is implied: 


AG) al 


Therefore, if 


B, = ko; at least one y, < =), 
n 
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then A, € B,. Thus to show P(A) = 0, which implies the theorem, it 
is sufficient to get lim, P(B,) = 0. But 


n-2 


=Ulx0; Ws ‘2, 


ray $a (EH) -E EH) -t 


x0) (FH) <4) 


coal (xC) -xG 


OP + BB/n 3 
= nP( x(") < $£) = a al grt ax| x 
n n 2m J-6g/n 
Substitute nx = y. Then 


P(B ) ae ee lang : 0 
he "ae [oe <| i 


Corollary 12.27. Almost every sample path of X(t) has infinite variation on 
every finite interval. 


Proof. \f a sample function X(t, w) has bounded variation on /, then it has a 
derivative existing almost everywhere on Z. 


A further result gives more information on the size of the oscillations of 
X(t). Since EIX + Ac) — X(x)|? = Ar, as a rough estimate we would 


guess that |X(¢ + Ar) — X(1)| ~ V Ar. Then for any fine partition to, ..., tp 
of the interval [t, £ + 7], 


> X(t) — X(q DI? = 7. 


The result of the following theorem not only verifies this, but makes it 
surprisingly precise. 


Theorem 12.28. Let the partitions $, of [t, t + 7), 3, = (t4,..., t), 
I3, = sup (22), — #4] satisfy ||F,|| — 0. Then 
k 


Sa = ZIX) — X(n > 7. 
If DF qll < 00,8, > 7. 
1 
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Proof. Assume t = ¢{”, t+ 7 =1t™, otherwise do some slight modi- 
fication. Then 7 = >” (t, — %_1) (dropping the superscript), and 


(XC) = X(t)? — (tk — tra)] 


S NE mè 


S, — T 


(XG) — Xl)? — EX) — Xlir). 


The summands are independent, with zero means. Hence 


HS, — 7}? = Y EIXU) — Xi)? — (a — HP. 


(X(t) — X(tye-1))?/(t — te-1) has the distribution of X*, where X is 
N(0, 1). So 


E(S, — 1)? = EX? — 18-3 (ty — tha)? 


< E(X? — 1%: |S, 7, 


proving convergence in mean square. If X ||,,|| < 00, then use the Borel- 
Cantelli lemma plus the Chebyshev inequality. 


Theorem 12.28 holds more generally, with S, 5 r for any sequence 
T, of partitions such that ||f,„|| + 0 and the S,, are successive refinements. 
(See Doob [39, pp. 395 ff.].) 


9. LAW OF THE ITERATED LOGARITHM 


Now for one of the most precise and well-known theorems regarding 
oscillations of a Brownian motion. It has gone through many refinements 
and generalizations since its proof by Khintchine in 1924. 


Theorem 12.29. For normalized Brownian motion 
— X 
im ~ xo = 1 
tlo ./2tlog (log 1/t) 


Proof. 1 follow essentially Lévy’s proof. Let y(t) = J2t log (log 1/#). 
1) For any ô > 0, 


Pim (X(t)/p(t)) > 1+ 6) =0. 


Proof of (1). Take q any number in (0, 1), put ta = q”. 
The plan is to show that if C, is the event 


{X(t) > (1 + ôjp(t) for at least one ¢ € [t,43, tnl} 
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then P(C, i.o.) = 0. Define M(r) = sup X(t) and use 


Ca = {M(t,) > (1 + Plta DD 


valid since g(t) is increasing in t. Use the estimates gotten from taking limits 
in 12.20. 


P(M(t,) > tn) < P(X) > xt) = J z J “et dz 


fo) 2 27/2 
< 4 cede = [2 . 
WX Je TtT xX 


Hence, letting x, = (1 + DUANA 


P(M(t,) > x7) < [amam 


7 Xn 


and since 
x, = (1 + ô)V24 log [(n + 1) log 1/4] 


= V2 log c(n + DÈ, 
where 2 = q(1 + 6)’, 
(12.30) P(M(t,) > (1 + d) (tas) 
1 
G+ DNiog@ +) 


For any 4, select q so that g(1 + 6)? > 1. Then the right-hand side of (12.30) 
is a term of a convergent sum and the first assertion is proved. 


2) For any ô > 0, 
P(lim (X(1)/p(t)) > 1-8) = 1, 


Proof of (2). Take g again in (0, 1), t, = g”, let Z, = XCi,) — X(t,4,)- 
The Z,, are independent. Suppose we could show that for e > 0, 


P(Z, > (1 — plt, io.) = 1. 


This would be easy in principle, because the independence of the Z,, allows the 
converse half of the Borel-Cantelli lemma to be used. On the other hand, 
from part one of this proof, because the processes {X(t)} and {— X(1)} have 
the same distribution, 


P(X(tqia) < (1 + ©)p(tye2) i0.) = 0 
or 


X(t) > —(1 + Plta) 
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holds for all n sufficiently large. From X(¢,) = Z, + X(t) it follows 
that infinitely often 

X(t) = a = €)P(tn) = (1 + ©)P(tn+1) 


= eas o Kaw 
= ZE GED |: a.s. 


Note that 9(t,,,1)/@(tn) > v4. Therefore, if we take e, q so small that 
1— e- 21+ v42 1- ô, 
the second part would be established. So now, we start epee 
P(Z, > xitan — trad trav) = Fal? el dz ~ =e enw le 


as x — oo. Let 


At,) = (l 5 €) 
ta — tng 1— q 
= Jalogen, «= 21 — «(1 — q). 


x, = (1—6) 4/2 log (n log 1/4) 


Then 

iee 

V2m n”? Slog nn 

By taking q even smaller, if necessary, we can get « < 2. The right-hand 
side is then a term of a divergent series and the proof is complete. Q.E.D. 


P(Z, > (1 — glt) ~ 


10. BEHAVIOR AT t = © 
Let Y, = X(k) — X(k — 1). The Y, are independent (0, 1) variables, 
X(n) = Yı + °°: + Y, is the sum of independent, identically distributed 


random variables. Thus X(t) for ż large has the magnitude properties of 
successive sums S,. In particular, 


Proposition 12.31 

x f 

20 > 0 as t>o. 
Proof. Since EY, = 0, we can use the strong law of large numbers to get 
X(n)/n “*, 0. Let 

Z, = max |X(k + t) — X(k)|. 
ostisi 

For t € [k, k + 1], 


xo X(t) _ X(k) w] Bie le. 


i |X OS ary Zi. 


sy ra 
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The first term —“> 0. Z, has the same distribution as maX, <: <ı |X(4)]. By 
12.20, EZ, < œ. Now use Problem 10, Chapter 3, to conclude that 


Zik —> 0. 

This is the straightforward approach. There is another way which is 
surprising, because it essentially reduces behavior for t — œ to behavior for 
t | 0. Define 

X(t) = tX(1/), 1 > 0, 


= 0, t= 0. 
Proposition 12.32. X(t) is normalized Brownian motion on [0, œ). 


Proof. Certainly X(t) is Gaussian with zero mean. Also, 
EX()X(s) = ts min k ; ) = min (s, 2). 
S 


Now to prove 12.31 another way. The statement X(t)/t — 0 as t — œ 
translates into tX(1/t) > 0 a.s. as £ — 0. So 12.31 is equivalent to proving 
that X(t) — 0 a.s. as t — 0. If X(t) is a version of Brownian motion with 
all paths continuous on [0, 0), then trivially, X(t) — 0 a.s. at the origin. 
However, the continuity of X(t) on [0, œ) gives us only that the paths of 
(1) are continuous on (0, œ). Take a version X(t) of X(t) such that 
all paths of X(t) are continuous on [0, œ). By Problem 5, almost all paths 
of X(t) and X(t) coincide on (0, œ). Since X(t) > 0 as t — 0, this is 
sufficient. 

By using this inversion on the law of the iterated logarithm we get 


Corollary 12.33 
=æ 4/21 log (log #) 
Since — X(t) is also Brownian motion, 


X(t) 
DD m 
Za \/2t log (log t) 
Therefore, 
=~ XO 
tm «/2t log (log t) 
The similar versions of 12.29 hold as t + 0; for instance, 


X(t) 


(12.34) lim ———=—_. = — 
o V2t log flog t| 
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11. THE ZEROS OF X(t) 


Look at the set 7(w) of zeros of X(t, œ) in the interval [0,1]. For any 
continuous function, the zero set is closed. By (12.29) and (12.34), T(w) is an 
infinite set a.s. Furthermore the Lebesgue measure of T(w) is a.s. zero, 
because I(T(w)) = I(t; X(t) = 0) = $5 x49(X@) dt, so 


EX(T(o)) = E f XO) dt = Í " P(X() = 0) dt = 0, 


where the interchange of E and f} dt is justified by the joint measurability 
of X(t, w), hence of X (X(t, w)). (See Problem 7.) 


Theorem 12.35. For almost all w, T(w) is a closed, perfect set of Lebesgue 
measure zero (therefore, noncountable). 


Proof. The remaining part is to prove that T(œ) has no isolated points. 
The idea here is that every time X(t) hits zero, it is like starting all over again 
and the law of the iterated logarithm guarantees a clustering of zeros starting 
from that point. For almost all paths, the point t = 0 is a limit point of 
zeros of X(t) from the right. For any point a > 0, let t* be the position 
of the first zero of X(t) following t = a, that is, 


t* = inf {t; X(t) = 0, t > a}. 


Look at the process X(t) = X(t + t*) — X(t*). This is just looking at 
the Brownian process as though it started afresh at the time t*. Heuristically, 
what happens up to time t* depends only on the process up to that time; 
starting over again at t* should give a process that looks exactly like Brownian 
motion. If this argument can be made rigorous, then the set C, of sample 
paths such that t* is a limit point of zeros from the right has probability one. 
The intersection of C, over all rational a > 0 has probability one, also. 
Therefore almost every sample path has the property that the first zero 
following any rational is a limit point of zeros from the right. This precludes 
the existence of any isolated zero. Therefore, the theorem is proved except 
for the assertion, 


(12.36) X(t + t*) — X(t*) is a Brownian motion. 


The truth of (12.36) and its generalization are established in the next section. 
Suppose that it holds for more general random starting times. Then we 
could use this to prove 


Corollary 12.37. For any value a, the set T = {t; X(t)=a,0 <t< J} 
is, for almost all w, either empty or a perfect closed set of Lebesgue measure 
zero. 
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Proof. Let t* be the first t such that X(t) = a, thatis, t* = inf {t; X(t) = a}. 
If t* > 1, then T‘ is empty. The set {t* = 1} = {X(1) = a} has proba- 
bility zero. If t* < 1, consider the process X(t + t*) — X(t*) as starting 
out at the random time t*. If this is Brownian motion, the zero set is perfect. 
But the zero set for this process is 
T = {t; X(t + t*) — X(t*) = 0, O<t < I} 
= {5 X(t + t*) =a, O<t< I}. 
Hence T® = T A (0, 1] is perfect a.s. 


12. THE STRONG MARKOV PROPERTY 


The last item needed to complete and round out this study of sample path 
properties is a formulation and proof of the statement: At a certain time 
t*, where t* depends only on the Brownian path up to time t*, consider the 
motion as starting at t*; that is, look at X(t) = X(t + t*) — X(t*). 
Then X(t) is Brownian motion and is independent of the path of 
the particle up to time t*. Start with the observation that for r > 0, fixed, 


(12.38) X(t) = X(t + 7) — X(7), t> 0 


is Brownian motion and is independent of ¥(X(s), s < 7). Now, to have 
any of this make sense, we need: 


Proposition 12.39. If t* > 0 is a random variable so is X(t*). 


Proof. For any n > 0, let 


Ay = Fa =I <"< ‘|. 
n n 
X™(e#) = 5 x(") Xa- 
k=1 n 


X'™(c*) is a random variable. On the set {t* < N}, 


XE — XE L sup IXE + A) — X. 
osrin 


The right-hand side + 0, so X™(t*) — X(t*) everywhere. 


Next, it is necessary to formulate the statement that the value of t* 
depends only on the past of the process up to time t*. 


Definition 12.40. For any process {X(t)} a random variable t* > 0 will be 
called a stopping time if for every t > 0, 


{t* < 1} € F(X(r),7 < t). 


The last step is to give meaning to “the part of the process up to time t*.” 
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Look at an example: 
t* = inf {t; X() > 1}, 
t 


so t* is the first time that X(t) hits the point x = 1. It can be shown quickly 
that t* is a stopping time. Look at sets depending on X(t), 0 < t < t*; for 
example, 


B= fo; int xe) < -1}. 
tSt" 
Note that for any t > 0, B A {t* < t} depends only on the behavior of the 
sample path on [0, ¢], that is, 
B A {t* < the F(X), 7 < t). 
Generalizing from this we get 


Definition 12.41. The o-field of events BeF such that for every t > 0, 
B A {t* < t} e€ F(X(7), t < t) is called the o-field generated by the process 
up to time t* and denoted by F (X(t), t < t*). 


This is all we need: 
Theorem 12.42, Lett* be a stopping time, then 
Xr) = X + t*) — X(t*), 2 > 0 
is normalized Brownian motion and 
F(X), t > 0) is independent of (X(t), t < t*). 


Proof. If t* takes on only a countable number of values {7,}, then 12.42 
is quick. For example, if 


Ay,...,4,€B, t...,4;20, and BeF(X(2),t < t*), 
then 
P(X(ty) € Ay, ..., XM(t,) E Ay, B) 
= > P(X) (t) € Ay, . . . , X(t.) E Ay, t* = Th B). 
Now, 
{X(t,) € A, ...,t* = 7,, B} = {X(t + 7) — X(T) E A.. 3 t* = To B}. 
Furthermore, 


{t*= TY OB= {t* = Th A {t* < Ta) N B, 
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so is in F(X(t), t < 7,). By (12.38), then, 
P(X(4) € Ay... , t* = Ty, B) 

= P(X(t, + 7) — X(T) € Ar- . X(t; + Ta) — X(x) € A,) 

: P(t* = 7,, B) 

= P(X(t,) € Ay, ..., X(t;) € Aj)P(t* = 7,, B). 
Summing over k, we find that 
(12.43) P(X(t) € A,,..., XM(t,) E€ A; B) 

= P(X(4) € A, . . . ,X(t;) € A;)P(B). 
This extends immediately to the statement of the theorem. In the general 
case, we approximate t* by a discrete stopping variable. Define 
k k-1 k 


7, ——— [L tts, k>1, 
n n n 


3 


A occ! k=1. 
n n 


Then t* is a stopping time because for k/n < t < (k + 1)/n, 

{t < t} = {t* < k/n} e F(X(s), s < k/n). 
But F(X(s), s < k/n) < F(X(s), s < t). Also, I assert that 

Be F(X(t), t < t*) = Be F(X(t), t < t), 
the latter because for k/n < t < (k + 1)/n, 

BO {tr < th = B A {t* < k/n} e F(X), s < t). 
Let X!¥(r) = X(r + tt) — X(t"). By (12.43), for B E F(X(t), ¢ < t*), 
P(X) < xn- ., XPG) < x;, B) 
= P(X(t,) < xy... , X(t;) < x;)P(B). 


But, by the path continuity of X(t), XP (t) > X(t) for every w, t. This 
implies that at every point (xı, . . . , x;) which is a continuity point of the 
distribution function of X(t), .. . , XP (t), 


P(X) < xis., XP) < Xm B) 
= P(X(t) < xx- <- , X(t) < xn) P(B). 


This is enough to ensure that equality holds for all (x, . . . , x,). Extension 
now proves the theorem. 
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Problem 10. Prove that the variables 
t = inf {t; X() =0,t > a}, 
t = inf {t; X(t) = b}, 


are stopping times for Brownian motion. 


NOTES 


The motion of small particles suspended in water was noticed and de- 
scribed by Brown in 1828. The mathematical formulation and study was 
initiated by Bachelier [1, 1900], and Einstein [45, 1905], and carried on 
extensively from that time by both physicists and mathematicians. But 
rigorous discussion of sample path properties was not started until 1923 when 
Wiener [141] proved path continuity. Wiener also deduced the orthogonal 
expansion (12.24) in 1924 [142]. 

A gcod source for many of the deeper properties of Brownian motion 
is Lévy’s books [103, 105], and in the recent book [76] by Ito and McKean, 
A very interesting collection of articles that includes many references to 
earlier works and gives a number of different ways of looking at Brownian 
motion has been compiled by Wax [139]. The article by Dvoretski, Erdés, 
and Katutani [42] gives the further puzzling property that no sample paths 
have any “points of increase.” 

The fact of nondifferentiability of the sample paths was discovered by 
Paley, Wiener, and Zygmund [115, 1933]. The law of the iterated logarithm 
for Brownian motion was proved by Khintchine [88, 1933]. The properties of 
the zero sets of its paths was stated by Lévy, who seemed to assume the truth 
of 12.42. This latter property was stated and proved by Hunt [74, 1956]. 
David Freedman suggested the proof given that no zeros of X(t) are isolated. 


CHAPTER 13 


INVARIANCE THEOREMS 


1. INTRODUCTION 


Let S,,S,,... be a player’s total winnings in a fair coin-tossing game. A 
question leading to the famous arc sine theorem is: Let N,, be the number 
of times that the player is ahead in the first n games, 


N, = {number of k; k < n, S, > 0}. 


The proportion of the time that the player is ahead in n games is N,/n = W,. 
Does a limiting distribution exist for W,, and if so, what is it? 

Reason this way: Define Z‘(t) = Sin as ¢ ranges over the values 
0<t< 1. Denote Lebesgue measure by /; then 


W, = Ht; Z() > 0, O<t < I}. 


Now Z'"(t) = Sin does not converge to anything in any sense, but recall 


from Section 2 of the last chapter that the processes X'"(t) = Stat) [Nn 
have all finite dimensional distribution functions converging to those of 
normalized Brownian motion X(t) as n—» œ. We denote this by 


x'"-) en X(-). But W, can also be written as 
W, = Ht; XA >0, 0<t<1} 


Of course, the big transition that we would like to make here would be to 
define 
W= ft, XN > 0, OSIS! 


so that W is just the proportion of time that a Brownian particle stays in the 
positive axis during (0, 1], and then conclude that 
W, -> w. 


The general truth of an assertion like this would be a profound generalization 

of the central limit theorem. The transition from the obvious application 

of the central limit theorem to conclude that X™(-) 2y X(-) to get to 
272 
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Wa -2> W is neither obvious or easy. But some general theorems of this 
kind would give an enormous number of limit theorems. For example, if 
we let 

M, = max (ISl, mrii g IS, 1); 


is it possible to find constants A, such that M,/A,, as M, M nondegenerate? 
Again, write 
M, = max [Sing] = y^ max [X(2)]. 
te{0,1] te[0,1] 
Let 
M = max |X(2)I, 


te[0,1] 


then apply our nonexistent theorem to conclude 
D 
M,/vn — M. 


This chapter will fill in the missing theorem. The core of the theorem is that 
successive sums of independent, identically distributed random variables with 
zero means and finite variances have the same distribution as Brownian 
motion with a random time index. The idea is not difficult—suppose 
S,,5,,... form the symmetric random walk. Define T, as the first time 
such that |X(t)| = 1. By symmetry P(X(T,) = 1) = P(X(T,) = —1) = 4. 
Define T, as the first time such that |X(t + T,) — X(T,)] = 1, and so on. 
But T, is determined only by the behavior of the X(t) motion up to time 
T,, so that intuitively one might hope that the X(¢ + T,) — X(T,) process 
would have the same distribution as Brownian motion, but be independent 
of X(t), t < T,. To make this sort of construction hold, the strong Markov 
property is, of course, essential. But also, we need to know some more about 
the first time that a Brownian motion exits from some interval around the 
origin. 


2. THE FIRST-EXIT DISTRIBUTION 


For some set Be B, let 
t*(B) = inf {t; X(t) € B°} 


be the first exit time of the Brownian motion from the set B. In particular, 
the first time that the particle hits the point {a} is identical with the first exit 
time of the particle from (— œ, a) if a > 0, or from (00, a) if a < 0, and 
we denote it by t*. Let t*(a, b) be the first exit time from the interval (a, b). 
The information we want is the probability that the first exit of the particle 
from the interval (a, b), a < 0 < b, occurs at the point b, and the value of 
Et*(a, b), the expected time until exit. 
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For normalized Brownian motion {X(1)}, use the substitution X(r) = 
X(s) + (X(4) — X(s)) to verify that for s < t, 


E(X(t) | X(t), 7 < s) = X(s) a.s. 
E(X) — t| X(7), 7 < s) = X%(5) — s a.s. 
Hence the processes {X(/}, {X?(t) — t} are martingales in the sense of 


Definition 13.1. A process {Y(t)}, t € 1, is a martingale if for allt in I, 
E |Y(t)| < œ, and for any s,tinI,s < t, 


E(Y(t) | Y(t) T < s) = Y(s) a.s. 


Suppose that for stopping times t* satisfying the appropriate integrability 
conditions, the generalization of 5.31 holds as the statement EY(t*) = EY(0). 
This would give, for a stopping time on Brownian motion, 


(13.2) EX(t*)=0, E(X%(t*) — t*) = 0. 


These two equations would give the parameters we want. If we take 
t* = t*(a, b), (13.2) becomes 


aP(X(t*) = a) + bP(X(t*) = b) = 0. 
Solving, we get 


la| 
|b] + lal 


bl 
[b] + lal 


(13.3) P(X(t*) = b) = P(X(t*) = a) = 


The second equation of (13.2) provides us with 

a*P(X(t*) = a) + b°P(X(t*) = b) = Et*(a, b). 
Using (13.3) we find that 
(13.4) Et*(a, b) = {al {b]. 


Rather than detour to prove the general martingale result, we defer proof 
until the next chapter and prove here only what we need. 


Proposition 13.5. For t* = t*(a,b), a<0< b, 
EX(t*) = 0, EX%(t*) = Et*. 


Proof. Let t* be a stopping time taking values in a countable set {t,} = [0, 7], 
+t < œ. Then 


E(X(r) — X(e*) = X f X) — XUD) dP. 
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By independence, then, E(X(r) — X(t*)) = 0; hence EX(t*) = 0. For 
the second equation, write 
EX(r)? = E(X(r) — X(t*) + X(t*))? 
= E(X(r) — X(t*))? + EX?(t*) + 2E(X(7) — X(t*))(X(t*)). 


The last term is easily seen to be zero. The first term is 


> (X(T) — X(t,))? dP 
j t*=t 
which equals ee an 
Eir — 1) Pe* = t) = 7 — Et*. 
Since EX(r)? = 7, 
EX*(t*) = Et*. 


Take t** = min (t*(a, b), 7), and t* a sequence of stopping times taking 
values in a countable subset of [0,7] such that t% — t** everywhere. By 
the bounded convergence theorem, Et* — Et**. Furthermore, by path 
continuity X(t*) — X(t**), and 


XD? < sup IX(OP, 


which is integrable by 12.20. Use the bounded convergence theorem again 
to get EX(t*) + EX(t**), EX?(t*) + EX?(t**). Write t* for t*(a, b), then 


Í X(t**) dP = Í „XC dP + il Xap, 


| X?(t**) dP = f X%(t*) dP + Í X%(r) dP. 

t*<Sr t*>r 
Note that |X(t*)| < max (laj, ||), and on t* > 7, |X(7)| < max (lal, |b|). 
Hence as 7 > œ, 

EX(t**) > EX(t*),  EX%t**) > EX%(t*). 


Since t** f t*, apply monotone convergence to get lim Et** = Et*, com- 
pleting the proof. 


Problem 1. Use Wald’s identity 5.34 on the sums 
S, = È (XKAN) — X((k — 1) At), 
1 
and, letting At —> 0, prove that for t* = t*(a, b), and any å, 
E exp [AX(t*) — 44°t*] = 1. 


By showing that differentiation under the integral is permissible, prove 13.5. 
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3. REPRESENTATION OF SUMS 


An important representation is given by 


Theorem 13.6. Given independent, identically distributed random variables 
Yo Ya... , EY, = 0, EY? = o? < œ, S, = Y, + °°: + Yn there exists a 
probability space with a Brownian motion X(t) and a sequence T,, T2,... of 
nonnegative, independent, identically distributed random variables defined on it 
such that the sequence X(T,), X(T, + T2),..., has the same distribution as 
Si, Sa, . . . , and ET, = 0°. 


Proof. Let (Un, Vn), n = 1,2,... be a sequence of identically distributed, 
independent random vectors defined on the same probability space as a 
Brownian motion {X(f)} such that F(X(ż),t > Ojand F(U,,V,,2 = 1,2,...) 
are independent. This can be done by constructing (Q, Fi, P1) for the 
Brownian motion, (Qs, F, P2) for the (U,,, V,,) sequence, and taking 


(Q, F, P) = (Qi X Qs, Fy X Fe, Py X Po). 
Suppose U, < 0 < V,. Define 
T, = inf {t; X(t) € (U4, V,)%}- 


Therefore the U,, V, function as random boundaries. Note that T, is a 
random variable because 


{T, <1} =! inf X@ < us| U | sup X(7) > vı). 
O<rSt O<rS<t 


Further, (X(t + 7) — X(z), t > 0) isindependent of ¥(X(s), s < 7, Uj, Vj). 
By the same argument which was used to establish the strong Markov 
property, 

XVH = XE + Ty) — X(T), 120 
is a Brownian motion and is independent of X(T,). Now define T, as the 
first exit time of X(t) from (Us, Va). Then X(T, + T,) — X(T) has the 
same distribution as X(T,). Repeating this procedure we manufacture 


variables 
X(T, +: + T,) — XT, +e + Ty), 


which are independent and identically distributed. The trick is to select 
U,, Vı so that X(T,) has the same distribution as Y,. For any random 
boundaries U,, V,, if E |X(T,)|? < œ, 13.5 gives 


E(X(T,)|U,,V,) =0 as, 


EQXCT,) | U,, V;) = E(T,|U,, Vi) as. 
Hence 
EX(T,) = 0, EX*(T,) = ET. 
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Therefore if X(T,) has the same distribution as Y,, then Y, must satisfy 


E,Y = 0, 
and automatically, 
o = EY] = ET, 


So EY, = 0 is certainly a necessary condition for the existence of random 
boundaries such that X(T,) has the same distribution as Y}. 

To show that it is also sufficient, start with the observation that if Y, 
takes on only two values, say u < 0, and v > 0 with probabilities p and q, 
then from EY, = 0, we have pu + gv = 0. For this distribution, we can 
take fixed boundaries U, = u, V, =v. Because, by 13.5, EX(T,) = 0, 
uP(X(T,) = u) + vP(X(T,) = v) = 0, which implies that P(X(T,) = u) = p. 
This idea can be extended to prove 


Proposition 13.7. For any random variable Y, such that EY = 0, EY? < œ 
there are random boundaries U < 0 < V such that X(T) 2 Y; 


Proof. Assume first that the distribution of Y is concentrated on points 
u; < 0, v; > 0 with probability p,,g, such that (u;, v;) are pairs satisfying 
usp; + viq; = 0. Then take 


P((U, v) = (u; v,)) = Pi + qi: 


By the observation above, 


P(X(T) = u; | (U, V) = (u; v)) = —— 
T qi 
Therefore X(T) has the same distribution as Y. Suppose now that the dis- 
tribution of Y is concentrated on a finite set {y,} of points. Then it is easy 
to see that the pairs u; < 0, v; > 0 can be gotten such that Y assumes only 
values in {u,}, {v,} and the pairs (u,, v,) satisfy the conditions above. (Note 
that u; may equal u;, i # j, and similarly the v; are not necessarily distinct.) 
For Y having any distribution such that EY = 0, EY? < œ, take Y, 2y 
where Y, takes values in a finite set of points, EY, = 0. Define random 
boundaries U,,, V,, having stopping time T, such that X(T,,) 2 Y,, Suppose 
that the random vectors (U,, V,) have mass-preserving distributions. Then 
take (Un, Va) 2, (U, Y). For these random boundaries and associated 
stopping time T, for J = (0, œ), use (13.3) to = 
P(X) eiU, Y) = U, V). 
( | U, V) uV NTE gU, V) 
Similarly, 
P(XCT,) € 71 Un Ya) = g(U,, Yn). 
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Hence if P(Y € bd(I)) = P(V € bd(/)) = 0, then 
P(Y € I) = lim P(Y, E D 
= lim £g(U,, Yn) 


= Eg(U, V) 

= P(X) el ) R 
The analogous proof holds for 7 < (0, — œ). To complete the proof we 
need therefore to show that the Y,, can be selected so that the (U,, V,) have 
a mass-preserving set of distributions. Take F(dy) to denote the distribution 
of Y. We can always select a nonempty finite interval (a, b) such that in- 
cluding part or all of the mass of F(dy) at the endpoints of (a, b) in the integral 
we get 


f yF(dy) = 0. 


Thus we can always take the distributions F,, of the Y,, such that 


f YF, (dy) = 0. 
[a,b] 


In this case, the (u;, v;) pairs have the property that either both are in [a, b] 
or both are outside of [a, b]. Since EY? < œ, we can also certainly take the 
Y,, such that EY? < M < œ forall n. Write 
EY?, = E(E(X*(T,) | Un, V,)) = E(ET, | Uns Vn) 
= E |U,V,l. 
But the function |ur| goes to infinity as either |u| —— 00 or |v| — œ everywhere 
in the region {u < a, v > b}. This does it. 


Problem 2. Let the distribution function of Y be F(dy). Prove that the 
random boundaries (U, V) with distribution 
P(U € du) = F(du) 
PIV E€ dv | U = u) = a |v| gu, v)F(dv), 
where «~! = EY* and g(u, v) is zero or one as u and v have the same or 


opposite signs, give rise to an exit time T such that X(T) 2y, (Here U and 
V can be both positive and negative.) 


4. CONVERGENCE OF SAMPLE PATHS 
OF SUMS TO BROWNIAN MOTION PATHS 


Now it is possible to show that in a very strong sense Brownian motion 
is the limit of random walks with smaller and smaller steps, or of normed 
sums of independent, identically distributed random variables. The random 
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walk example is particularly illuminating. Let the walk X(t) take steps of 
size +1//n every 1/n time units. Using the method of the previous section, let 
T, be the first time that X(t) changes by an amount 1//n, then T, the time 


until a second change of amount 1 INn occurs, etc. 


The process ` 
XOA = XM, + + Ting) 


has the same distribution as the process X(t). By definition, 


T, +: + Ting = time until [nt] changes of magnitude 
1 Iv n have occurred along X(t). 


Therefore, up to time T, +-+- + Ting the sum of the squares of the 
changes in X(t) is approximately ¢. But by 12.28, this takes a length of time f. 
So, we would expect that T, + -+> + Ting > ¢, hence that each sample 
path of the interpolated motion xm) would converge as a function of t 
to the corresponding path of Brownian motion. The convergence that does 
take place is uniform convergence. This holds, in general, along subsequences. 


Theorem 13.8. Let Y,, Yz, ... be independent, identically distributed random 
variables, EY, = 0, EY? = œ < œ, S, =Y, ++ Y, Define the 
processes X'™(t) by Siny lon. Then there are processes {X'"(t)}, for eachn 
having the same distribution as {X'™(t)}, defined on a common probability space 
and a Brownian motion process {X(t)} on the same space, such that for any 
subsequence {n,} increasing rapidly enough, 


sup |X) — X()| +0 a.s. 
OStS1 


Proof. Assume that EY? = 1. Let (Q, ¥, P) be constructed as in the repre- 
sentation theorem. For each n, consider the Brownian motion X,(t) = 
VnX(t/n). Construct Ti”), T&,... using the motion X,(¢). Then the {S,} 
sequence has the same distribution as the {X,(T\”” + --- + T{”)} sequence. 
Thus the X'")(t) process has the same distribution as 


me (nd ats fs (n) 
(13.9) x2) = x(= + + wi), 
n 


The sequence T™, T™, ... for each n consists of independent, identically 
distributed random variables such that ET® = 1, and Tœ has the same 
distribution for all n. The weak law of large numbers gives 


Tt Te 
[nt] ” 


so for n large, the random time appearing in (13.9) should be nearly t. 


280 INVARIANCE THEOREMS 13.4 


Argue that if it can be shown that 


TT” Seer Ti 
W. = su LE i ce ede E —>0 as. 
o<tS1 n 


for n running through some subsequence, then the continuity of X(t) 
guarantees that along the same subsequence 


sup [X™() — X(DI>0 as. 
o<stS1 


What can be easily proved is that W, > 0. But this is enough because 
then for any subsequence increasing rapidly enough, W,, 23, 0. Use 


Tm = Ti") ar 
SO o 
ET = 0, 


TP bot GS i: 
n 


n 


W, = sup 


ostS1 


Ignore the second term, and write 
TO bo + Tiin 
[nt] 


Wp < sup t 


0<tS1 
For any e, 0 < e < 1, take 
TOt Tih 


W, =e su 
a [nt] 
Tin) eee Tim 
fe sup | EET cgi, 
kzi k 


The distribution of M'” is the same as MY, Now write 
Te +e + Teh 
[nt] 

Th +e Ti” 
r | 


W,, = su 


e<tS1 


< su 


~ RB Yen] 


This bounding term has the same distribution as 


Tw pg Tw) 


su 
A k 


k> [en] 
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The law of large numbers implies 
TH toe t+ TP a.s. 
k 


3 


so that W7 +0. Since W, < W; + Wy, then for any « > 0, x > 0, 
lim P(W,, > x) < P(eM™ > x). 


Taking e | 0 now gives the result. 


5. AN INVARIANCE PRINCIPLE 
The question raised in the introduction to this chapter is generally this: 


The sequence of processes X'")(-) converges in distribution to Brownian motion 
X(-), denoted X™(-) ar X(-), in the sense that forany0 Lt L'h <1, 


D 
(X(t), ie (245 XH —> (X(t), ae X(t,)). 
Let H(x(:)) be defined on R®™. When is it true that 


H(X, w)) 25 H(X(-, w))? 


There is no obvious handle. What is clear, however, is that we can proceed 
as follows: Let us suppose X‘"’(-) and X(-) are defined on the same space and 
take values in some subset D € R'1], On D define the sup-norm metric 


p(x), yO) = sup |x) — xO, 


and assume that H(x(:)) defined on D is continuous with respect to p. Then 
if the sample paths of the X‘"'(-) processes converge uniformly to the corre- 
sponding paths of X(-), that is, if 

sup |X'(t) — XÐ] +0 a.s., 

O<tS1 
then 

H(X(-)) “+ H(X()). 

But this is enough to give us what we want. Starting with the X(t) = 
Stag lov. n, we can construct X™(-) having the same distribution as X‘")(-) 
so that the Kt) — X(t) uniformly for ż € [0,1] for n running through subse- 
quences. Thus, H(X‘(-)) =", H(X(-)), n € {m}. This implies 


HXH) H(X(), n e {m}. 


But this holding true for every subsequence {n,} increasing rapidly enough 
implies that the full sequence converges in distribution to H(X(-)). Now to 
fasten down this idea. 


282 INVARIANCE THEOREMS 13.5 


Definition 13.10. D is the class of all functions x(t), O < t < 1, such that 
x(t—), x(t+) exist for all t € (0, 1), and x(t+) = x(t). Also, x(0+) = x(0), 
x(1—) = x(1). Define p(x(-), y(-)) on D by 


sup |x(t) — y(t]. 
o<t<1 


Definition 13.11. For H(x(-)) defined on D, let G be the set of all functions 
x(-) € D such that H is discontinuous at x(-) in the metric p. If there is a 
set G, € BION such that G < G, and for a normalized Brownian motion 
{X(1)}, P(X(-) € G,) = 0, call H a.s. B-continuous. 

The weakening of the continuity condition on H to a.s. B-continuity is 
important. For example, the H that leads to the arc sine law is discontinuous 
at the set of all x(-) € D such that 


Kt; x(t) =0, O0<t< I} >o0. 


(We leave this to the reader to prove as Problem 4.) But this set has proba- 
bility zero in Brownian motion. 

With these definitions, we can state the following special case of the 
“invariance principle.” 


Theorem 13.12. Let H defined on D be a.s. B-continuous. Consider any 
process of the type 


S 
X(t) = H, 
= 32 


where the S„ are sums of independent, identically distributed random variables 
Yı, Ya... with EY, = 0, EY? = œ. Assume that the H(X'™(-)) are random 
variables. Then 


(13.13) H(X) om H(X(+)), 


where {X(t)} is normalized Brownian motion, 


Proof. Use 8.8; it is enough to show that any subsequence {n,} contains a 
subsequence {n,} such that (13.13) holds along n,. Construct X,(t) as in the 
proof of 13.8. Take n, any subsequence of n, increasing rapidly enough. 
Then X‘*(t) converges uniformly to X(t) for almost every w, implying that 
(13.13) holds along the n, sequence. 

There is a loose end in that H(X(-)) was not assumed to be a random 
variable. However, since 


H(X!"#(-)) 25 H(XC)), 


the latter is a.s. equal to a random variable. Hence it is a random variable 


13.6 THE KOLMOGOROV-SMIRNOV STATISTICS 283 


with respect to the completed probability space (Q, ¥, P), and its distribution 
is well defined. 


The reason that theorems of this type are referred to as invariance 
principles is that they establish convergence to a limiting distribution which 
does not depend on the distribution function of the independent summands 
Y,, Ya, . . . except for the one parameter o?. This gives the freedom to choose 
the most convenient way to evaluate the limit distribution. Usually, this is 
done either directly for Brownian motion or by combinatorial arguments 
for coin-tossing variables Y,, Y,,... In particular, see Feller’s book [59, 
Vol. I], for a combinatorial proof that in fair coin-tossing, the proportion of 
times W, that the player is ahead in the first n tosses has the limit distribution 


lim P(W,, < x) = 2/7 arc sine Vx. 


Problems 
3. Show that the function on D defined by 


H(x()) = sup [x(2)| 
: š osts 
is continuous everywhere. 
4. Show that the function on D defined by 


H(x()) = Kt; x(t) > 0, 0<t< 1} 
is continuous at x(-) if and only if /{t; x(t) = 0} = 0. 


6. THE KOLMOGOROV-SMIRNOV STATISTICS 


An important application of invariance is to an estimation problem. Let 
Y,, Ya... be independent, identically distributed random variables with 
a continuous but unknown distribution function F(x). The most obvious 
way to estimate F(x) given n observations Y,,..., Y,, is to put 


F (x) = 1 (aumber of Y, <x, kK=1,...,n) 
n 


1 n 
oe > Xio, (Yi). 
ny 
The law of large numbers guarantees 
F(x) > FQ) 
for fixed x. From the central limit theorem, 


Ja Êa) — F) > (0, F(C — FO). 
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However, we will be more interested in uniform estimates: 
Dy = yn sup (F(x) — F(x), 
Dz = Jninf(F,(x) — FŒ), 
z 
D, = y/n sup [F,(x) — FO), 
and the problem is to show that D}, D7, D„ converge in distribution, and 


to find the limiting distribution. 


Proposition 13.14. Each of D}, D}, D, has the same distribution for all 
continuous F(x). 


Proof. Call 7 = [a,b] an interval of constancy for F(x) if P(Y, €I) =0 
and there is no larger interval containing J having this property. Let B be 
the union of all the intervals of constancy. Clearly, we can write 


D, = Jn sup |F,(x) — FC), 
zeBS 
and similar equations for D$ and D>. For x € B°, the sets 
{Ye <x}, (FV) < FO) 
are identical. Put U, = F(Y,), and set 


G,(y) = {number of U, < y,k = 1,..., n}. 
Then 


D, = yn sup IG (Fœ) — F(x) 
= Jñ sup IĜ (F=) — FO). 


Since F(x) maps R® onto (0, 1) plus the points {0} or {1} possibly, 
D, = Vit sup 1G,() — 9! 


velo,1 
= /n sup |G,(y)— yl a.s., 
yel0,1] 


the latter holding because P(U, = 0) = P(U, = 1) = 0. The distribution 
of U, is given by 
P(U, < y) = P(F(Y)) < y). 


Put x = inf {§; F($) = y}. Then 
PU, < y) = P(FY:) < FX) = PY, < x) = Fx) = y. 
Thus U, is uniformly distributed on {0, 1], and D, for arbitrary continuous 
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F has the same distribution as D, for the uniform distribution. Similarly for 
D7, and DŻ. 

Let U,,..., U„ be independent random variables uniformly distributed 
on [0, 1]. The order statistics are defined as follows: Um is the smallest, 
and so forth; UM» is the largest. The maximum of 16,0) — y| or of 

G,(y) — yory— Gy) must occur at one of the jumps of G,(y). The 
jumps are at the points Ut», and 


6, cus) = =F 
Since the size of the jumps is 1/n, then to within iV. n, 


D 


> 


LW ag sai x 


a 


Dt = ,/n max 6 = ur”), 


kSn \n 
= k ( 
Dz = ./nmin (- — U,” 
va kSn (; ). 


The fact that makes our invariance theorem applicable is that the 
um, ...,U® behave something like sums of independent random vari- 
ables. Let W,, W2,... be independent random variables with the negative 
exponential distribution. That is, 


PW, > x) =e, x>0. 
Denote Z, = W, + ++- + W,; then 


Proposition 13.15. U™, k = 1,..., n have the same joint distribution as 
ZdZ k= 1l,...,n 


Proof. To show this, write (using a little symbolic freedom), 
P(Z, € dx1, Zg € dx, © © © , Zniy E dX) 
= P(W, € dx,, We E dx. — Xy, 6665 Waar E Xna1 — Xn) 


=e Zig (rati) we e (2n41—@n) dx, coe AX nsy 


= g 7n dx, TAN AX nya 0 < xy < Xa < RAN < Xntie 
Thus 


P(Z, €dx,,...,Z, € dx, | Zn41 = Xn) 


n! Xna dx, '*+dx,, OSX, S x2 Let SK Xa, 


0, otherwise. 
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From this, 


P( A edi Že Edy, | Zn = Xn] 


n+l n+l 
oe OSN LYX, 


0, otherwise. 


Therefore, 


HA edy,,..., Zn edyn) 


n+l Zant 


oe O0<n<i<y Sh, 
0, otherwise. 
On the other hand, forO < yı, Stt: Sy, Ll, 


P(U;” Edy,,..., Us” € dyna) = E P(U,, Edy,,..., Ur, € dy), 


where the sum is over all permutations (/,,...,/,) of (1,...,m). Using 
independence this yields n! dy, - - + dy,. 


Use this proposition to transform the previous expression for D, into 


Z: k 


D = 
D, = ./n max 
Zn n 


kin 


with analogous expressions for D}, DZ. Then 


n41 kSn 


Because EW, = 1, o°%(W;) = 1, it follows that nila > 1, and that 
Z, —k is a sum of independent, identically distributed random variables 
with first moment zero and second moment one. Put S, = Z, — k, 


> 


X) = ta 
Jn 


and ignore the n/Z,,,, term and terms of order 1Vn. Then D, is given by 
sup [X'™(t) — 1X (1)]. 
O<t<1 
Obviously, 
sup |x(t) — tx(1)| 
O<t<1 


is a continuous function in the sup-norm metric, so now applying the 
invariance principle, we have proved 
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Theorem 13.16 


D, —> sup [X(t) — X(1)1, 
O<t<1 
D 
Dt — sup (X(t) — 1X(1)), 
ossi 


D; 2> inf (x(t) — 2X(1)). 
O<tS1 


7. MORE ON FIRST-EXIT DISTRIBUTIONS 


There is a wealth of material in the literature on evaluating the distributions 
of functions on Brownian motion. One method uses some transformations 
that carry Brownian motion into Brownian motion. A partial list of such 
transformations is 


Proposition 13.17. If X(t) is normalized Brownian motion, then so is 


1) X(t) ¢t >0 (symmetry), 

2) X(t + 7)— X(t), t >0,7 > Ofixed (origin change), 

3) tX(1/), t> 0 (inversion), 

4) (1/V'«)X(a2), t>0,«>0 (scale change), 

5) X(T) — X(T — t, O<t < T fixed (reversal). 
To get (4) and (5) just check that the processes are Gaussian with zero means 
and the right covariance. 

We apply these transformations and the strong Markov property to get 
the distributions of some first exit times and probabilities. These are related 


to a number of important functions on Brownian motion. For example, for 
x > 0, if t* is the first hitting time of the point {x}, then 


OSrSt 


(13.18) P( sup X(7) < x) = P(t > t). 
To get the distribution of tž, let 


plà, x) = Ee**, ADO. 


Take x, y > 0, note that ¢7,, = tž + Tř, where t* is the first passage time 
of the process X(t) = X(t*¥ + t) — X(tž) to the point y. By the strong 
Markov property, t* has the same distribution as t¥ and is independent of 


t*. Thus, 


(13.19) GA, x + y) = HA, x) eA, y). 
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Since (å, x) is decreasing in x, and therefore well-behaved, 13.19 implies 


(13.20) plà, x) = el, 
Now we can get more information by a scale change. Transformation 


13.17(4) implies that a scale change in space by an amount Va changes time 
by a factor «. To be exact, 


= inf {t; X(t) = x} 
= a inf {tjæ; (1/,/a)X(a(t/a)) = x/fa} 
= a inf {t'; X(t’) = x/ sa}. 
Therefore t* has the same distribution as at*, yz, yielding 
(13.21) 9A, x) = glad, x/./a), 
> g(d) = cy, 
=> (x, A) = e", x > 0, 
= g(x, A) = eVa, all x, by symmetry. 


Now (x, 4) uniquely determines the distribution of t*, so if we can get c, 
then we are finished. Unfortunately, there seems to be no very simple way 


to get c. Problem 5 outlines one method of showing that c = V2. Accept 
this for now, because arguments of this sort can get us the distribution of 


t*(a, b) = min (tă, t$) {0} €(a, b). 
Denote 


t* = min (tft) A, ={te <te}, A, = {te < tT}, 


so that on A, t% = t*, and on A, t* = t* + +*, where t* is the additional 
time needed to get to x = a once ‘the process has hit x = b. So define 


* = min {t; X(t* + t) — X(t*) = a — b}. 
Put these together: 


Eei -Í et dP | ee dp, 
Aa 


Ay 


Now check that A, E€ F F (X(t), t < t*). Since the variable t* is independent 
of F(X(t), t < t*), 


gV lal =| ent" dP + PVE] et dP. 
Aa Ap 
The same argument for t¥ gives 


(13.22) eV bl = Venton | 


Aa 


e| dP +Í e>" dP. 
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Now solve, to get 


ay nI e ap = — Sinn G/2A Ib) 
ee): 1) fe P= inh AGB] + lab) 
yf etap — Sbil 
A sinh (\/2A(\b| + lal)) 


The sum of (13.23(1) and (2)) is Ee~**", the Laplace transform of the 
distribution of t*. By inverting this, we can get 


P(a < X(7) < b, all 7 € (0, #]). 


Very similar methods can be used to compute the probability that the 
Brownian motion ever hits the line x = at + b, a > 0,b > 0, or equiva- 
lently, exits from the open region with the variable boundary x = at + b. 
Let p(a, b) be the probability that X(t) ever touches the line at + b, a > 0, 
b > 0. Then 


p(a, by + bz) = pla, b;)p(4, ba), 


the argument being that to get to at + bı + by, first the particle must get 
to at + b,, but once it does, it then has to get to a line whose equation 
relative to its present position is at + b,. To define this more rigorously, 
let t* be the time of first touching at + b,; then t* = min(t*,s) is a stopping 
time. The probability that the process ever touches the line at + b, + bz 
and t* < s equals the probability that the process X(¢ + t*) — X(t*) ever 
touches the line at + b, and t* < s. By the strong Markov property, the 
latter probability is the product p(a, b,)P(t* < s). Let s— œ to get the 
result. Therefore, p(a, b) = eè}, Take tř to be the hitting time of 
the point b. Then 


pla, b) = P( sup (X(t + tt) — X(t) — (at + at) > o). 
0St< o 


Conditioning on t* yields 
pla, b) ea Eert, 


(see 4.38). Use (13.21) to conclude that 


V 2ayla) b 
> 


pla, b) = € 
which leads to 2ay(a) = y?(a), or y(a) = 2a. Thus 
(13.24) pla, b) = e°”, 
The probability, 


P(sup 2 > 1), 
t at+b 
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of exiting from the two-sided region |x| < at + b is more difficult to com- 
pute. One way is to first compute Ee~***, where t* is the first time of hitting 
at + b, and then imitate the development leading to the two-sided boundary 
distribution in (13.23). Another method is given in Doob [36]. 

The expression for p(a, b) can be used to get the distribution of 


sup (X(t) — 1X(1)) 


and therefore of lim, P(Dt < x). Let Y(t) = X(t) — tX(1). Then Y(t) is 
a Gaussian process with covariance 


EY(HÐ)Y(s) = s(1 — t) O<s<t<l. 
Consider the process 


Mr) t 
x =a +9) t>0. 


Its covariance is min (s, £t), so X(t) is normalized Brownian motion. 
Therefore 


XA 
P Y = 
(sp YO > v) = (sup >) 


= P (sup (Xo) — yt- y) 2 0) 
t20 


= ply, y) = e. 


The limiting distribution for D, is similarly related to the probability of 
exiting from the two-sided region {|x| < y(1 + i}. 


Problems 
5. Assuming 
E exp [—At*] = exp [—eva \x]], 


find Ee~**", where t* = t*(—1, +1). Differentiating this with respect to A, 
at A = 0, find an expression for Et* and compare this with the known value 
of Et* to show that c = af 2. 


6. Use Wald’s identity (see Problem 1) to get (13.23(1) and (2)) by using 
the equations for A and —A. 


7. Using E exp [—’t*] = exp [—/24 |x|], prove that for x > 0, 


P( sup X(7) > x) = 2P(X(t) > x). 
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8. For S,,S,... sums of independent, identically distributed random 
variables with zero means and finite second moments, find normalizing 
constants so that the following random variables converge in distribution 
to a nondegenerate limit, and evaluate the distribution of the limit, or the 
Laplace transform of the limit distribution 


a) max (S,,...,5,); b) max ((S,|,... ISa), c) S +e + S, 


8. THE LAW OF THE ITERATED LOGARITHM 


Let S,, n = 1,2,..., be successive sums of independent, identically dis- 
tributed random variables Y,, Y.,..., with 


EY, = 0, EY? = 0? < œ. 
One version of the law of the iterated logarithm is 


Theorem 13.25 
— S 
lim — =1 as. 
/20°n log (log n) 


Strassen [134] noted recently that even though this is a strong limit theorem, 
it follows from an invariance principle, and therefore is a distant consequence 
of the central limit theorem. The result follows fairly easily from the repre- 
sentation theorem, 13.8. What we need is 

Theorem 13.26. There is a probability space with a Brownian motion X(t) 
defined on it and a sequence Sn n=1,..., having the same distribution as 


S,/o,n = 1,..., such that 
Šin — X(t) 


(13.27) H 
V21 log (log t) 


—>0 as. 
as t > ©. 


Proof. By 13.8, there is a sequence of independent, identically distrib- 
uted, nonnegative random variables T,, Ta, ..., ET, =1 such that 
X(T, +: +T,), 2=1,2,..., has the same distribution as S,/o, 
n = 1,... Therefore (13.27) reduces to proving that 
X(T, tect Tio) Er X(t) 
p(t) 


where p(t) = V2¢ log (log t). By the law of large numbers, 


0 as. 


lt tTn, 
t 


1 a.s. 
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For any e > 0, there is an almost surely finite function t,(w) such that 
for t > taw), 


t 
Tt tTwe[ a +9]. 


Let 
M(t)= sup = |X(r) — X(N. 


ggss ot 
Thus, for t, = (1 + €)", and ty < t < tui, 
M(t)< sup = |X(r) — X) 


teeiStStrere 


< 2sup |[X(7) — X(t). 


t-i STS eta 


In consequence, if we define 


M= sup |X(7)— X(tz-)l, 


tk-1$1S tke 


fm cotimMe., 
p(t) P(t) 


PM(, > x) < 2P([X(tys2) — X(t)! > x). 
Write betes — ty 4 = Oty, where ô = (1 + €)? _ (1 + e). Then 


IXCtr+2) — X(t4)I 
` V feyo — toy 


By 12.20, 


P(M; > V28 o(t,)) < 2P( > 2,/log (log ip] 


2 2 1 
< —— ex —2 lo lo t AE 
T pl g (log t,)] = toe i 


Bi 
T k?(log G+ e) 
Use Borel-Cantelli again, getting 


P(M, > 26 v(t) i.o.) = 0 


or 
lim M: < /26 a.s. 
, ty) 
Going back, 
— [S — X(t — 
lim Rois X </86 a.s. 
v(t) 


Taking e | 0 gives 6 — 0 which completes the proof. 
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9. A MORE GENERAL INVARIANCE THEOREM 


The direction in which generalization is needed is clear. Let the {Y'(t)}, 


t e [0, 1], = 0, 1, . . . be a sequence of processes such that Y'”)(-) 2r Y) 
in the sense that all finite dimensional distributions converge to the appro- 
priate limit. Suppose that all sample functions of {Y™(t)} arein D. Suppose 
also that some metric p((-), y(-)) is defined on D, and that in this metric H 
is a function on D a.s. continuous with respect to the distribution of Y(-). 
Find conditions to ensure that 


H(¥™()) 2> H(¥)). 


This has been done for some useful metrics and we follow Skorokhod’s 
strategy. The basic idea is similar to that in our previous work: Find 


processes Y'"(-), n = 0, 1,... defined on a common probability space such 
that for each n, Y‘"(-) has the same distribution as Y™(-), and has all its 
sample functions in D. Suppose Y‘"(-) have the additional property that 


oY), YOO) ork 0. 


Then conclude, as in Section 5, that if H(Y®()) and H(Y'(:)) are random 
variables, 


HY H(¥(-)). 
The basic tool is a construction that yields the very general 
Theorem 13.28. Let {Y'™(t)}, t € [0,1], n = 0, 1,... be any sequence of 
processes such that Y‘”'(-) 2s Y(-). Then for any countable set T < [0, 1], 
there are processes {Y'"(t)}, t € T, defined on a common space such that 


a) For each n, {Y™(t)}, t € T, and {Y(t}, t € T, have the same distribution. 
b) For every t € T, 


Y(t) cae YOA 
as n —> ©, 


Proof. The proof of this is based on some simple ideas but is filled with 
technical details. We give a very brief sketch and refer to Skorokhod [124] 
for a complete proof. 


First, show that a single sequence of random variable? X, Ees X, can 
be replaced by X,, =>> X, on a common space with x, = 2 x,. It is a bit 
surprising to go from 2y to. But if, for example, X,, are fair coin- 
tossing random variables such that X, ay Xə then replace all X,, by the 
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random variables X,„ on ((O, 1), B((0, 1)), dx) defined by 


—1, xe(0, 4], 


X,(x) = 
+1, xe, 1). 


Not only does X,,(x) > X,(x), but X,(x) = X,(x). In general, take 
(G, F, P) = ((0, 1), B((0, 1)), dx). The device is simple: If F(z), the dis- 
tribution function of X,, is continuous with a unique inverse, then take 


X,(x) = Fr'(x). 
Consequently, 
P(X,,(x) < z) = l{x; F3 (x) < z} 
= l{x; x < F,(z)} 
= F,(z). 


Since X,, Dy Xo, F,(Z) > Fo(z) for every z; thus F(x) > F(x), all x, 
or X,(x) basi X,(x). Because F, may not have a unique inverse, define 


X,(x) = inf {y; F,(y) > x}. 


Now verify that these variables do the job. 

Generalize now to a sequence of process X, = (X{”,...) such that 
X, 2y X,. Suppose we have a nice 1-1 mapping 6: R'°) <> B, B € B,, such 
that 0, 6-! are measurable Bo, B,(B) respectively, and such that the following 
holds: 


6(X,) > 6(X)). 


D 
Take Y„, n > 0, random variables on a common space such that Y, = 0(X,) 
and Y, —> Yp. Define X, = 6-(Y,). It is easy to see that X, and X, 
have the same distribution. If 0 is smooth enough so that Y, —> Yo 
implies that every coordinate of 0*(Y,,) converges a.s. to the corresponding 
coordinate of #"(Y,), then this does it. To get such a 0, let C,,, be the set 
{x; P(X) = x) > 0}. Let C = UnC; Cis countable. Take g(x): R? << 
(0, 1) to be 1-1 and continuous such that g(C) contains no binary rationals. 
There is a 1-1 measurable mapping f: (0, 1)'@ + (0, 1) constructed in 
Appendix A.47. The mapping 6: Rœ <> (0, 1) defined by 


A(x, Xz ) = f(x), g(x), V(X), ee 3) 


has all the necessary properties. 
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Take one more step. The process {X(t)}, t € T, having the same dis- 
tribution as {X‘")(1)}, t € T, has the property that a.s. every sample function 
is the restriction to T of a function in D. Take T dense in [0, 1], and for any 
t € [0, 1], ¢ ¢ T, define 


Xt) = lim X(t), 1,6 Tt Lt. 
k 


{Assume {1} € 7.] The processes {XOD}, t € [0, 1], defined this way have 
all their sample functions in D, except perhaps for a set of probability zero. 
Furthermore, they have the same distribution as X‘®’(-) for each n. 

Throwing out a set of probability zero, we get the statement: For each 
w fixed, the sample paths x,(t) = X'"(t, w) are in D with the property that 
X,(t) > x(t) for all że T. The extra condition needed is something to 
guarantee that this convergence on T implies that p(x,(-), xo(:)) > 0 in the 
metric we are using. To illustrate this, use the metric 


p(x), ¥-)) = sup |x(#) — yl, 


introduced in the previous sections. Other metrics will be found in Skorokhod’s 
article referred to above. Define 


6,(h) = sup O — x,(7)|, (h) = lim 4,(h). 
t—r| <h n 


If x,(t) > x(t) for t€ 7, and x(t)€ D, then lim, (h) = 0 implies 
p(X, Xo) > 0. Hence the following: 


Theorem 13.29. Under the above assumptions, if 
(13.30) lim lim P( sup Y(t) — YD) > ‘) = 0, 
i hlo n ]i—r| Sh 
then 
sup |¥™(t) — TO —> 0. 
OStS1 


Proof. Y(t) has continuous sample paths a.s. Because take T,, = 
{ty,...,tm} © T. Then, letting 


M, = sup |Y) — Y), 


[t-te] Sh 


we find that M, ==> M, follows. For e a continuity point in the distribution 
of Mo, P(M, > €) — P(My > €). Therefore, 


P(My > €) < lim P( sup [V(r — YE) > e). 


[t—r]Sh 


Letting T,, ¢ T and using (13.30) implies the continuity. 
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Define random variables 


§,(h) = sup [Y™(t) — Yr), 
l 


t—r|<h 


U, = sup |Y — Ya). 
o<tS1 


If T,, = {t1,..+5 tm} C T is such that 


lfz+1 — &| <A, k=1,...,m—1, 
then 
U, < sup [YP — YOD + 8,(h) + 8h). 


tETm 


The first term goes to zero a.s., leaving 
lim P(U, > €) < lim P(8,(h) > €/3) + P(8,(h) > €/3). 


Take h | 0. Since U,, does not depena on h, the continuity of Y(-), and 
(13.30) yields 
lim P(U, > €) = 0 > U, > 0. 


Remark. Note that under (13.30), since Y(-) has continuous sample paths 
a.s., then Y(t) has a version with all sample paths continuous. 


RS D 7 
The general theorems are similar; Y™() —> Y(-) plus some equi- 
continuity condition on the Y'"'(-) gives 


PČ, Y)) > 0. 
Problem 9. For random variables having a uniform distribution on {0, 1], 
and F,,(x), the sample distribution function defined in Section 6, use the multi- 


dimensional central limit theorem to show that 
A D 
{F.(€) — €} —> {YH}, &e [0, 1], 
where Y(€) is a Gaussian process with covariance 
EY(n)Y(E) = 71 — E, OLEI. 
Prove that (13.30) is satisfied by using 13.15 and the Skorokhod lemma 3.21. 


NOTES 


The invariance principle as applied to sums of independent, identically 
distributed random variables first appeared in the work of Erdés and Kac 
[47, 1946] and [48, 1947]. The more general result of 13.12 is due to Donsker 
[30, 1951]. The method of imitating the sums by using a Brownian motion 
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evaluated at random times was developed by Skorokhod [126, 1961]. The 
possibility of using these methods on the Kolmogorov-Smirnov statistics 
was Suggested by Doob [37] in a paper where he also evaluates the distribu- 
tion of the limiting functionals on the Brownian motion. Donsker later 
(31, 1952] proved that Doob’s suggested approach could be made rigorous. 
For some interesting material on the distribution of various functionals on 
Brownian motion, see Cameron and Martin [13], Kac [82], and Dinges [24]. 

Strassen’s recent work [134, 1964] on the law of the iterated logarithm 
contains some fascinating generalizations of this law concerning the limiting 
fluctuations of Brownian motion. A relatively simple proof of the law of 
the iterated logarithm for coin-tossing is given by Feller [59, Vol. I]. A 
generalized version proved by Erdös [46, 1942] for coin-tossing, and extended 
by Feller [53, 1943] is: Let ø(n) be a positive, monotonically increasing 
function, S, = Y,+-°-'+Y,, Yj,..., independent and identically 
distributed random variables with mean zero and finite second moment. 
Then 


S : 
P( z- > 9(n) i.o.) 
on 
equals zero or one, depending on whether 
$ e(n) P 
ron 
converges or diverges. 
The general question concerning convergence of a sequence of processes 


XM) 22; X(-) and related invariance results was dealt with in 1956 by 
Prokhorov [118] and by Skorokhod [124]. We followed the latter in Section 9. 

The arc sine law has had an honorable history. Its importance in 
probability has been not so much in the theorem itself, as in the variety and 
power of the methods developed to prove it. For Brownian motion, it was 
derived by Paul Lévy [104, 1939]. Then Erdés and Kac [48, 1947] used an 
invariance argument to get it for sums of independent random variables. 
Then Sparre Andersen in 1954 [128] discovered a combinatorial proof that 
revealed the surprising fact that the law held for random variables whose 
second moments were not necessarily finite. Spitzer extended the combina- 
torial methods into entirely new areas [129, 1956]. For the latter, see par- 
ticularly Spitzer’s book [130], also the development by Feller [59, Vol. II]. 
Another interesting proof was given by Kac [82] for Brownian motion as a 
special case of a method that reduces the finding of distribution of functionals 
to related differential equations. There are at least three more proofs we 
know of that come from other areas of probability. 


CHAPTER 14 


MARTINGALES AND PROCESSES WITH 
STATIONARY, INDEPENDENT INCREMENTS 


1. INTRODUCTION 


In Chapter 12, Brownian motion was defined as follows: 


1) X(t + 7) — X(t) is independent of everything up to time z, 
2) The distribution of X(t + 7) — X(t) depends only on 7, 


3 
) Atso At 


The third assumption involved continuity and had the eventual consequence 
that a version of Brownian motion was available with all sample paths 
continuous. 

If the third assumption is dropped, then we get a class of processes 
satisfying (1) and (2) which have the same relation to Brownian motion as 
the infinitely divisible laws do to the normal law. In fact, examining these 
processes gives much more meaning to the representation for characteristic 
functions of infinitely divisible laws. 

These processes cannot have versions with continuous sample paths, 
otherwise the argument given in Chapter 12 forces them to be Brownian 
motion. Therefore, the extension problem that plagued us there and that 
we solved by taking a continuous version, comes back again. We deal with 
this problem in the same way—we take the smoothest possible version 
available. Of the results available relating to smoothness of sample paths, 
one of the most general is for continuous parameter martingale processes. 
So first we develop the martingale theorems. With this theory in hand, we 
then prove that there are versions of any of the processes satisfying (1) and 
(2) above, such that all sample paths are continuous except for jumps. Then 
we investigate the size and number of jumps in terms of the distribution 
of the process, and give some. applications. 


2. THE EXTENSION TO SMOOTH VERSIONS 


Virtually all the well-known stochastic processes {X(t)}, reJ/, can be 

shown to have versions such that all sample paths have only jump discon- 

tinuities. That is, the sample paths are functions x(t) which have finite 

right- and left-hand limits x(*—) and x(t+) at all ¢ € Z for which these limits 
298 
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can be defined. This last phrase refers to endpoints. Make the convention 
that if ¢ is in the interior of Z, both x(t—) and x(t+) limits can be defined. 
At a closed right endpoint, only the x(t—) limit can be defined. At a closed 
left endpoint only the x(t+) limit can be defined. At open (including infinite) 
endpoints, neither limit can be defined. We specialize a bit more and define: 


Definition 14.1. D(I) is the class of all functions x(t), t € I, which have only 
jump discontinuities and which are right-continuous; that is, x(t+) = x(t) for 
all t € I such that x(t+) is defined. 


Along with this goes 


Definition 14.2. A process {X(t)}, t € I will be called continuous in probability 
from the right if whenever 7 | t, 


X(r) => X(t). 


We want to find conditions on the process {X(t)}, t € J so that a version 
exists with all sample paths in D(/). As with Brownian motion, start by 
considering the variables of the process on a set T countable and dense in J, 
with the convention that T includes any closed endpoints of 7. In the case 
of continuous sample paths the essential property was that for J finite, any 
function defined and uniformly continuous on T had an extension to a con- 
tinuous function on Z. The analog we need here is 


Definition 14.3. A function x(t) defined on T is said to have only jump dis- 
continuities in I if the limits 


lim x(s), sET, im x(s), seT 
sit sit 
exist and are finite for all t € I where these limits can be defined. 


Proposition 14.4. If x(t) defined on T has only jump discontinuities on 1, then 
the function X(t) defined on I by 


X(t) = ii x(s), sET 
sit 
and X(b) = x(b) for b a closed right endpoint of I is in D(I). 
Proof. Let t, | t, ta, t€ 2, and take s, ET, 5, > ta > t such that s, | t 
and x(s,) — X(t,) > 0. Since x(s,) —> X(t), this implies X(t+) = X(t). Now 


take t, ft, ands, E T with t, < 5s, < tand X(t,) — x(s,) — 0. This shows 
that X(¢,,) > lim x(s), s€ T. 
ste 


We use this to get conditions for the desired version. 


Theorem 14.5. Let the process {X(t)}, teI, be continuous in probability from 
the right. Suppose that almost every sample function of the countable process 
{X(t)}, t€ 7, has only jump discontinuities on I. Then there is a version of 
{X(0)}, t€ I, with all sample paths in D(I). 


300 MARTINGALES, PROCESSES, WITH INCREMENTS 14.3 


Proof. If for fixed w, {X(t, œ)}, t€ T, does not have only jump discon- 
tinuities on 7, put X(t, w) = 0,all że 7. Otherwise, define 


X(t, w) = lim X(s, a), sEeT 
stt 


and X(b, w) = X(6, w) for b a closed right endpoint of 7. By 14.4, the 
process {X(f)}, t e J, so defined has all its sample paths in D(/). For any 
te I such that s, | t, s, € T, X(t) = lim, X(s,) a.s. By the continuity in 
probability from the right, X(s,) 2y X(t). Hence 

X) = X(t) as., 
completing the proof. 
Problem 1. For x(t)e D(I), J any finite closed subinterval of J, show that 


1) sup |x()| < œ, 
teJ 
2) for any ô > 0, the set 


{te J; |x(t) — x(t—)| > ô} 
is finite. 
3) The set of discontinuity points of x(t) is at most countable. 


3. CONTINUOUS PARAMETER MARTINGALES 
Definition 14.6. A process {X(t)}, t€J, is called a martingale (MG) if 
E |X| < æ, for allt € I, and if for all s, t E I, $ < t, 
E(X(t)| X(7), T S s) = X(s) a.s. 
Call the process a submartingale (SMG) if under the same conditions 
E(X(t) | X(7), 7 < s) > X(s) a.s. 
This definition is clearly the immediate generalization of the discrete 
parameter case. The basic sample path property is: 


Theorem 14.7. Let {X(t)}, t € I be a SMG. Then for T dense and countable 
in I, almost every sample function of {X(t)}, t € T, has only jump discontinuities 
on I. 


Proof. It is sufficient to prove this for J a finite, closed interval [r, 7]. 
Define 


(14.8) X“(t-) = lim X(s),_  X*(t-) = lim X(s), se T, 
ait slt 


X(+) = lim X(s) X*+) = lim X(s), seT. 
att 


stt 
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Of course, the limits for r— and 7+ are not defined. First we show that 
for almost every sample function the limits in (14.8) are finite for all te I. 
In fact, 


(14.9) sup |X()| < œ a.s. 
teT 


To show this, take Ty finite subsets of T, Ty | T and r, r € Ty. By adding 
together both (1) and (2) of 5.13, deduce that 


(14.10) P( sup IXI > x) < * QE [X(7)| — EX(n). 


Letting N — œ proves that (14.10) holds with T in place of Ty. Now take 
x — œ to prove (14.9). 

Now assume all limits in (14.8) are finite. If a sample path of {X(1)}, 
te T does not have only jump discontinuities on /, then there is a point 
v E I such that either X-(v—) < X*(v—) or X-(v+) < X*(v+). For any 
two numbers a < b, let D(a, b) be the set of all w such that there exists a 
v E I with either 


X(v-) <a < b < Xv) or X-(v+) <a < b < Xt). 


The union U D(a, b) over all rational a,b; a < b, is then the set of all 
sample paths not having only jump discontinuities. 

Take Ty finite subsets of T as above. Let By be the up-crossings of the 
interval [a, b] by the SMG sequence {X(t,)}, t; E Ty (see Section 4, Chap- 
ter 5). Then By f B, where B is a random variable, possibly extended. The 
significant fact is 

D(a, b) C {B = oo}. 
Apply Lemma 5.17 to get 
+ 
EB; < E(X(7) — a) 
b—a 
to conclude that EB < œ, hence P(D(a, b)) = 0. Q.E.D. 


The various theorems concerning transformation of martingales by 
optional sampling and stopping generalize, if appropriate restrictions are 
imposed. See Doob [39, Chap. 7] for proofs under weak restrictions. We 
assume here that all the processes we work with have a version with all 
sample paths in D(J). 

Proposition 14.11. Let t* be a stopping time for a process {X(t)}, t € I, having 
all sample paths right-continuous. Then X(t*) is a random variable. 


Proof. Approximate t* by 
ee k[n, (K—VD[n<t* < k/n, k >t 
” l/m, 0<t* < ijn, k=1. 
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Then, X(t*) is a random variable. For n running through 2”, t* | t*, so 
by right-continuity, X(t*) — X(t*) for every w. 


We prove, as an example, the generalization of 5.31. 


Theorem 14.12. Let t* be a stopping time for the SMG(MG) {X(1)}, t € [0, œ). 
If all the paths of the process are in D([0, «0)) and if 


a) EIX(t*)| < o, D imf, XOP =0, 
then S 
EX(t*) > EX(0). 
(=) 
Proof. Suppose first that t* is uniformly bounded, t* < 7. Take t* | t*, 
t* < 7, but t* taking on only a finite number of values. By 5.31, 


EX(t*) > EX(0). 
(=) 
and right-continuity implies X(tž) —— X(t*) everywhere. Some sort of 
boundedness condition is needed now to conclude EX(t*) —> EX(t*). 
Uniform integrability is sufficient, that is, 


sup | X(e%)] dP 
Ux >z} 


n 


goes to zero as x —> œ. If {X(t)} is a MG process, then {|X(t)|} is a SMG. 
Hence by the optional sampling theorem, 5.10, 


(14.13) [eye Kas] n Pear. 
{IX > 2} (1X8) > 2) 
Let 
M = sup |X(r)]. 
ONtSr 


By the right-continuity and (14.9), M < œ a.s. Then the uniform integra- 
bility follows from E |X(z)| < œ and 


Í |X(r)| dP <| [X(z)| dP. 

(Xen > 2} {M> æ} 

But if {X(ż)} is not a MG, then use this argument: If the SMG {X(z)} 
t > 0, were bounded below, say, X(f) > a, all t << 7 and w, then for 
x > lal, 


(1X08) >a} 


f IX(t*)| dP = X(t*) dP 
(1xle%)] > 2} 


{1x(e*)| > x} 


< I X(7) dP < IX) dP. 
(xe) >a} 


14.4 PROCESSES WITH STATIONARY, INDEPENDENT INCREMENTS 303 


This gets us to (14.13) again. Proceed as above to conclude EX(t*) > EX(Q). 
In general, for « negative, take Y(t) = max (a, X(t)). Then {Y(t}, t > 0 
is a SMG bounded below, so EY(t*) > EY(0). Take « — — œ and note 
that 
EX(t*) — EY(t*) -Í ; (X(t*) — æ) dP 
{X 


t*) <a} 
goes to zero. Similarly for EX(0) — EY(0), proving the theorem for 
bounded stopping times. If t* is not bounded, define the stopping time 
t** as min (t*, 7). Then 
E\X(**)| < EIXE + E|X@)| < o, 
so EX(t**) > EX(0). But 
EX(t*) — EX(t**) =| (X(t*) — X(7)) dP. 
{t*>1} 


The first term in this integral goes to zero as t — œ because E |X(t*)| < œ 
by hypothesis. For the second term, simply take a sequence 7„ —> œ such 
that 


Í IX(r,)| dP > 0. 
{e"> tn} 


For this sequence EX(t**) —— EX(t*), completing the proof. 


Problem 2. For Brownian motion {X(t)}, t > 0, and t* = t*(a, b), prove 
using 14.12 that EX(t*) = 0, EX%(t*) = Et*. 


4. PROCESSES WITH STATIONARY, 
INDEPENDENT INCREMENTS 


Definition 14.14. A process {X(t)}, t € [0, œ), has independent increments if 

for any t and t > 0, F(X(t + 7) — X(t)) is independent of F(X(s), s < t). 
The stationary condition is that the distribution of the increase does not 

depend on the time origin. 

Definition 14.15. A process {X(t)}, t € [0, œ) is said to have stationary in- 

crements if {(X(t + 7) — X(t)), 7 > 0, does not depend on t. 


In this section we will deal with processes having independent, stationary 
increments, and we further normalize by taking X(0) = 0. Note that 


X) = (X@) — Xt — 7)) 
+ (X(t — 7) — X(t — 27)) + +++ + (XO) — XO), 
where r = tjn. Or 
X(t) = XP to t X, 
where the X™, k = 1,..., are independent and identically distributed. 
Ergo, X(t) must have an infinitely divisible distribution. Putting this formally, 
we have the following proposition. 


304 MARTINGALES, PROCESSES, WITH INCREMENTS 14.4 


Proposition 14.16. Let {X(t)}, te [0, 00) be a process with independent, 
Stationary increments; then X(t) has an infinitely divisible distribution for 
every t > 0. 


It follows from X(t + s) = (X(t + s) — X(s)) + (XG) — X(0)) that 
J {w), the characteristic function of X(t), satisfies the identity 


(14.17) Suas) = Swf tu). 


If f(u) had any reasonable smoothness properties, such as f,(u) measurable 
B, in t for each u, then (14.17) would imply that f(W) = [fi@]', t > 0. 
Unfortunately, a pathology can occur: Let g(t) be a real solution of the 
equation y(t + s) = (t) + g(s), ts > 0. Nonlinear solutions of this do 
exist [33]. They are nonmeasurable and unbounded in every interval. Con- 
sider the degenerate process X(t) = y(t). This process has stationary, 
independent increments. 

Starting with any process {X(t)} such that f,(u) = [fi(u)]', then the 
process X’(t) = X(t) + g(t) has also stationary, independent increments, but 
Si # [f 0D]. This is the extent of the pathology, because it follows from 
Doob (39, pp. 407 ff.] that if {X(¢)} is a process with stationary, independent 
increments, then there is a function g(t), p(t + s) = p(t) + g(s} such that 
the process {X(t}, X(t) = X(t) — p(t) has stationary, independent incre- 
ments, X")(0) = 0, and f®(u) is continuous in ¢ for every u. Actually, this 
is not difficult to show directly in this case (see Problem 3). A sufficient 
condition that eliminates this unpleasant case is given by 


Proposition 14,18. Let {X(t)} be a process with stationary, independent in- 
crements such that f(u) is continuous at t = 0 for every u; then {X(t)} is 
continuous in probability and fu) = [f,(u)]’. 


Proof. Fix u, then taking s | 0 in the equations /,,,(u) = f(u) f(u) and 
Siw) = fw f(u) proves f(u) continuous for allt > 0. The only continuous 
solutions of this functional equation are the exponentials. Therefore f,(u) = 
e, Evaluate this at ¢ = 1 to get fi(u) = e. Use lim, |, fu) = 1 to 


conclude X(s) = 0 as s — 0, implying X(s) —~ 0. Since 
£(X(t) — X(t — 5)) = L(KE + 5) — XY) = L4(X()), s 29, 
the proposition follows. 
The converse holds. 


Proposition 14.19. Given the characteristic function f(u) of an infinitely 
divisible distribution, there is a unique process {X(t)} with stationary, indepen- 
dent increments, X(0) = 0, such that f(u) = [fD]. 


Remark. In the above statement, uniqueness means that every process 
satisfying the given conditions has the same distribution. 
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Proof. All that is necessary to prove existence is to produce finite-dimen- 
sional consistent distribution functions. To specify the distribution of 
X(t), X(te), 5. X(ta)s tr < ta < +++ < tn define variables Y,, Y2,..., Yn 
as being independent, Y, having characteristic function [f(u)]*~*4, tọ = 0. 
Define 


E(X), 2s vs XD) = La, Ya + Yan. Ya H Vg Hee Ya). 


The consistency is obvious. By the extension theorem for processes, 12.14, 
there is a process {X(t)}, t > 0, having the specified distributions. For these 
distributions, X(t + 7) — X(t) is independent of the vector (X(t), sha X(tn)), 
ti, ..+5¢_, < t. Thus the process {X(¢)} has independent increments. Further, 
the characteristic function of X(t + 7) — X(t) is [fD], all t > 0, so 
implying stationarity of the increments. Of course, the characteristic function 
of X(t) is {f(u)}’. By construction, X() = 0 a.s. since f(u) = 1, so we take 
a version with X(0) = 0. If there is any other such process {X(t)} having 
characteristic function [f(u)]’, clearly its distribution is the same. 

Since there is a one-to-one correspondence between processes with 
stationary, independent increments, continuous in probability and charac- 
teristic functions of infinitely divisible distributions, we add the terminology: 
If 

Fett) == eft) 


call y(u) the exponent function of the process. 


Problems 


3. Since | f,(u)| < 1, all u, show that (14.17) implies that | A) = |A. 
By (9.20), log fu) = i(t)u + f g(x, Wy (dx). Use | ful = fifu) to 
show that y,(R") is uniformly bounded in every finite ¢-interval. Deduce 
that Yis = y; + ys; hence show that y; = ty,. Now A(t + s) = P(t) + f(s) 
by the unique representation. Conclude that X")(t) = X(t) — A(t) has a 
continuous characteristic function f (u). (It is known that every measurable 
solution of p(t + s) = g(t) + p(s) is linear. Therefore if p(t) is measurable, 
then it is continuous and f,(u) is therefore continuous.) 

4. For a process with independent increments show that 

F(X(t + 7) — X(2)) 


independent of ¥(X(s), s < t)for each t,7 => F(X(¢+ 7) — X(t), 7 > 0)in- 
dependent of F (X(s), s < t). 


5. For a process with stationary, independent increments, show that for 
Be R, 


1) P(X(t + 7) € BI X(9),s < t) = P(X(t + 7) € BI X()) a.s., 
2) P(X(t + 7) e€ B|X(t) = x)= P(X(t)e€ B — x) as. P(X(t) e dx). 
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5. PATH PROPERTIES 


We can apply the martingale results to get this theorem: 


Theorem 14.20. Let {X(t)}, t > 0 be a process with stationary, independent 
increments continuous in probability. Then there is a version of {X(t)} with all 
sample paths in D([0, 0)). 


Remark. There are a number of ways to prove this theorem. It can be done 
directly, using Skorokhod’s lemma (3.21) in much the same way as was done 
for path continuity in Brownian motion. But since we have martingale 
machinery available, we use a device suggested by Doob [39]. 


Proof. Take, as usual, X(0) = 0. If E|X(t)| < œ, all t > 0, we are 
finished. Because, subtracting off the means EX(t) if necessary, assume 
EX(t) = 0. Then {X(t)}, t > 0 is a martingale since for 0 < s < t, 


E(X(t)| X€), 7 < s) = E(X(t) — X(s)| X(7), 7 < s) + X(s) = X(5) a.s. 


Simply apply the martingale path theorem (14.7), the continuity in probability, 
and 14.5, 

If E |X(¢)| is not finite for all ż, one interesting proof is the following: 
Take g(x) any continuous bounded function. The process on (0, 7] defined 


by 
Y(t) = E(9(X(z)) | X(9), s < t) 


is a martingale. But (see Problem 5), 
Y(t) = E(9(X(r)) | X(t)) = Wr — t, X(t)), 


O(t, x) = Ep(X() + x). 


where 


The plan is to deduce the path properties of X(r) from the martingale path 
theorem by choosing suitable g(x). Take ¢(x) strictly increasing, p(+ œ) = a, 
p(— 2) = —a, E|p(X(r))| = 1. By the continuity in probability of 
X(t), 0(7, x} is continuous in t. It is continuous and strictly increasing in x, 
hence jointly continuous in x and ¢, Thus, 


y= Or — f, x) 
has an inverse 


x = O(t, y) 
jointly continuous in (t, y), for 0 < t < 7, [yl < «x, and 
X(t) = O(t YO). 


For T dense and countable in (0, 7], the martingale path theorem implies that 
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{Y}, t€ T, has only jump discontinuities on [0, 7]. Hence, for all paths 
such that sup,.7 |Y(t)| < x, {X(0}, t € T, has only jump discontinuities on 
[0, 7]. Now, to complete the proof, we need to show that super |X(#)| < ©, 
a.s. because for every sample function, 


sup |X(t)| = œ <> sup |Y] = a. 


teT teT 


Since |Y(ż)| is a SMG, apply 5.13 to conclude 


P(sup KORS 5) < 2 EY). 
teT 2 a 
Since 


EYO) = E lX) = 1, 
we get 


P (sup IX(D) 2 o) ee. 
teT a 


Since « can be made arbitrarily large, conclude that 


P (sup XC g co) =ü. Q.E.D. 


In the rest of this chapter we take all processes with stationary, independent 
increments to be continuous in probability with sample paths in D({0, «)). 


Problems 


6. Prove, using Skorokhod’s lemma 3.21 directly, that 
sup |X()| < œ as. 
tET 


7. For {X(t)}, t > 0, a process with stationary, independent increments with 
sample paths in D({0, 00)), show that the strong Markov property holds; 
that is, if t* is any stopping time, then 


{X(t + t*) — X(t*)}, t20, 


has the same distribution as {X(t)}, t >0 and is independent of 
F (X(t), t < t*). 


8. For {X(#)}, t > 0, a process continuous in probability with sample paths 
in D([0, œ)), show that 


P(|X(t) — X(t—)| > 0) = 0 
for all t > 0. 
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6. THE POISSON PROCESS 


This process stands at the other end of the spectrum from Brownian motion 
and can be considered as the simplest and most basic of the processes with 
stationary, independent increments. We get at it this way: A sample point 
œ consists of any countable collection of points of [0, co) such that if N(/, w) 
is the number of points of œw falling into the interval 7, then N(/, w) < œ 
for all finite intervals 7. Define F on Q such that all N(J, w) are random 
variables, and impose on the probability P these conditions. 


Conditions 14.21 


a) The number of points in nonoverlapping intervals is independent. That is, 
I,..., 4, disjoint > N(,),..., N(,) independent. 

b) The distribution of the number of points in any interval I depends only on 
the length \\I\\ of I. 


A creature with this type of sample space is called a point process. Con- 
ditions (a) and (b) arise naturally under wide circumstances: For example, 
consider a Geiger counter held in front of a fairly sizeable mass of radioactive 
material, and let the points of w be the successive registration times. Or 
consider a telephone exchange where we plot the times of incoming telephone 
calls over a period short enough so that the disparity between morning, 
afternoon, and nighttime business can be ignored. Define 


(14.22) X(t) = N((O, ¢]). 


Then by 14.21(a) and (b), X(t) is a process with stationary, independent 
increments. Now the question is: Which one? Actually, the prior part of 
this question is: /s there one? The answer is: 


Theorem 14.23. A process X(t) with stationary, independent increments has a 
version with all sample paths constant except for upward jumps of length one if 
and only if there is a parameter 4 > O such that 


Ee”X D m tn, 
Remark. By expanding, we find that X(t) has the Poisson distribution 
P(X(t) = n) = ee 
and 


sien Same 


Proof. Let X(t) be a process with the given characteristic function, with 
paths in D([0, 00)). Then 


P(X(1) = k)= GO" gat, 
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that is, X(t) is concentrated on the nonnegative integers /*, so that 
P(X(t) €I*, allt € T) = 1 for any countable set T. Taking T dense in 
[0, 00) implies that the paths of the process are integer-valued, with proba- 
bility one. Also, {X(¢)}, że T, has nondecreasing sample paths, because 
£(X(¢ + 7) — X(t)) = £(X(7)), and X(7) > 0, a.s. Therefore there is a 
version of X(t) such that all sample paths take values in J+ and jump upward 
only. I want to show that for this version, 


P( sup (XO — X(t-)) < 1) =i 


By Problem 8, the probability that X(t) — X(t—) > 0 for t rational is 
zero. Hence, a.s., 


sup (X(t) — X(t—)) = lim max (X(k/n) — X(k — 1/n)). 
sg O<t<1 n A'SkSn 
P( max (X(kin) — X(k — 1Jn)) < 1) = P(X(I/n)) < 1)" 
= [e + An" 
>l as n— œ. 


To go the other way, let X(t) be any process with stationary, independent 
increments, integer-valued, such that X(t) — X(t—) = 0 or 1. Take tř to 
be the time until the first jump. It is a stopping time. Take t} to be the time 
until the first jump of X(¢ + tf) — X(t). We know that t} is independent 
of tf with the same distribution. The time until the nth jump is tf + --- + tř, 


Pitt >t +7) = P(t? > 7, X) — Xr) = 0,7 < E<t $7). 
Now {t > 7} = {t¥ < 7}’, hence {t¥ > 7} is in F(X(), t < 7). Therefore 
P(t > t +7) = P > 7)P(X(E) — X(r) = 0,7 S ES tt7) 

= P(tt > r)P(tt > 2). 
This is the exponential equation; the solution is 
P(t > t) =e! 
for some A > 0. To finish, write 
P(X) = n) = Pt +--+ <0 PË +--+ Ua < A). 
Denote Q,(t) = P(tf +---+ t* > t), and note that 
ee tio 


1, t <Š. 


Pit +t Sela to SeS 
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So we have 
Onal) = O40 — e | eO), na 


This recurrence relation and Q,(t) = e~*! gives 


Qnii(t) = ma tAttort n, 


leading to 
P(X(t) = n) = oy ett Q.E.D. 
n: 


7. JUMP PROCESSES 


We can use the Poisson processes as building blocks. Let the jump points 
of a Poisson process Y(t) with parameter A > 0 be t}, ta... Construct 
a new process X(t) by assigning jump X, at time t,, X; at time tz, . . . , where 
X,, Xa... are independent, identically distributed random variables with 
distribution function F(x), and F(X) is independent of F (Y(t), t > 0). Then 


Proposition 14.24. X(t) is a process with stationary, independent increments, 
and 


E exp [iuX(t)] = exp (at flexp Gu) — 1} are}. 


Proof. That X(t) has stationary, independent increments follows from the 
construction. Let A(u) = f e"? dF. Note that 


E(exp[iuX(t)) | Y(¢) = n) = Eexp [iu(X, +++: + X,)] = kD)", 


implying 
Eet XO =) pey EL are = un, 
0 


In Chapter 9, Section 4, it was pointed out that for a generalized Poisson 
distribution with jumps of size x1, xə. ..., cach jump size contributes an 
independent component to the distribution. This is much more graphically 
illustrated here. Say that a process has a jump of size Be 3B, at time 1 if 
X(1) — X(t—) E€ B. A process with stationary, independent increments and 
exponent function 


Í ("2 — 1)u(dx) 


with u(R®) < œ is of the type treated above, with A = w(R"), F(B) = 
u(B)/u(R™). Therefore it has sample paths constant except for jumps. Let 


14.7 JUMP PROCESSES 311 


R® — {0} = UZ, Be for disjoint B, € Bk = 1,...,n. Define processes 
{X(B,, 1}, t = 0, by 


(Bs) = È xr, X0) — XC -)). 


Thus X(B,, t) is the sum of the jumps of X(t) of size B, up to time 4. We 
need to show measurability of X(B,, t), but this follows easily if we construct 
the process X(t) by using a Poisson process Y(t) and independent jump 
variables X,, X,,..., as above. Then, clearly, 


X(t) = 2 X(B,» t). 
k= 
Now we prove: 


Proposition 14.25. The processes {X(B,, t)}, t > 0, are independent of each 
other, and {X(B,, t)}, t > 0 is a process with stationary, independent increments 
and exponent function 


Í (e — 1)u(dx). 
By 


Proof. It is possible to prove this directly, but the details are messy. So 
we resort to an indirect method of proof. Construct a sample space and 


processes {X")(t)}, t > 0, n = 1, ..., which are independent of each other 
and which have stationary, independent increments with exponent functions 


Í (e2 — 1yu(dx). 
By 


Each of these has the same type of distribution as the processes of 14.24, 
with A = u(B,), F,(B) = u(B A B,)/uB,). Hence F,(dx) is concentrated 
on B, and {X)(2)}, t > 0, has jumps only of size B,. Let 


n 


X(t) = YX (2). 


k=1 


Then XD}, t > 0 has stationary, independent increments and the same 


characteristic function for every £ as X(t). Therefore {X(} and {X(t)} have 
the same distribution. But by construction, 


XH) = D xaX — X =), 


therefore {X“(t)}, t > 0 is defined on the {X(t} process by the same func- 
tion as {X(B,, t); on the {X(¢)} process. Hence the processes {X(B,, }, 
{X(t)}} have the same joint distribution as {X“(1)}, {X(O}, proving the 
proposition. 
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8. LIMITS OF JUMP PROCESSES 


The results of the last section give some insight into the description of 
the paths of the general process. First, let {X(¢)}, £ > 0, be a process with 
stationary, independent increments, whose exponent function y has only an 
integral component: 


s iux 
=| (e-1- dx). 
y(u) fol = a)r x) 


If u(R®) < œ, then the process is of the form {X(t) — pt}, where {X(0)}, 
t > 0 is of the type studied in the previous section, with sample paths 
constant except for isolated jumps. In the general case, u assigns infinite 
mass to arbitrarily small neighborhoods of the origin. This leads to the 
suspicion that the paths for these processes have an infinite number of 
jumps of very small size. To better understand this, let D be any neighbor- 
hood of the origin, {X(D*, t)}, £ > 0, be the process of jumps of size greater 
than D, where we again define: the process {X(B, t)}, t > 0, of jumps of 
{X(t)} of size B € B,, {0} ¢ B, is given by 


(14.26) X(B, 1) = È xo(X( — X(t--)). 


Assuming that the results analogous to the last section carry over, 
{X(D*, t)} has exponent function 


Í (e — 1)u(dx). 
D 


Letting 
x 


Dl +x? 


f= u(dx), 


we have that the exponent function of X(D°, t) — ft is given by 


ypu) = if (e -1 n)a». 


Therefore, as D shrinks down to {0}, yp(u) — y(u), and in some sense we 
expect that X(z) is the limit of X(D°, t) — ft. In fact, we can get a very 
strong convergence. 


Theorem 14.27. Let {X(t)}, t > 0, be a process with stationary, independent 
increments having only an integral component in its characteristic function. 
Take {D,} neighborhoods of the origin such that D, | {0}, then {X(Df, t)}, 
t > 0 is a process with stationary, independent increments and exponent 
function 


[es = Data, 
Da 
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Put 


bo | ates z (x). 


For any ty < ©, 5 
sup |X(t) — X(D;, t) + atl —> 0 

OStS tg 

asn—> ©, 

Proof. Take D, = R®, B, = D,,, — D,. Construct a probability space on 


which there are processes {Z,(t)}, t > 0, with stationary, independent 
increments such that the processes are independent of each other, with 
exponent functions 


yu) = Í. (e"* — 1)u(dx). 


Then 14.25 implies that the paths of Z,,(t) are constant except for jumps of 
size B,. Denote 


XK) = FZ") and b, =| dey, 
a(t) 2 ( an Tron 


Consequently, X,(t) — „t is the sum of the independent components 
Z,(t) — byt. Since the characteristic functions converge, for every t, 


X(t) pat S>. 
Because we are dealing with sums of independent random variables, 8.36 
implies that for every ¢ there is a random variable X(t) such that 


XO — pie KO: 


This implies that {X()}, t > 0, is a process with stationary, independent 
increments having the same distribution as {X(t)}, t > 0. We may assume 
that {X(t)}, £ > 0 has all its sample paths in D([0, 00)). Take, for example, 
to = 1, T any set dense in [0, 1]. For Y,(0) = X(t) — X,,(0) + Bt, 


‘SUP. If] = sup Y,(0]. 


Let Ty be finite subsets of T, Ty Î T. Then ap Y, ()| = ie Fi IY, (Ð). 


Because Ý, (2)} is a process with stationary, independent inereienis we can 
apply Skorokhod’s lemma: 


P(sup IŽ, > 2) < 
teT yw 


1 ~ 
PUIY,(D| > y), 
— Cy 
where 
Cy < sup P(IY,(é)| > y). 
o<tS1 
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Let y,(u) be the exponent function of Y,,(1). Apply inequality 8.29 to 


write: 
-1 


a v 
P(IY,()| > y) < ay | J1 — ef) du. 
0 


For |e?| < 1, there is a constant y < œ such that |e — 1| < y |z|. Hence, 


-1 


~ L 
POV > 9) < ayy f Ip, du. 
Since y„(u) + 0, then Cy — 0, leading to 
~ P 
sup |Y,,(t)| —> 0. 
ostS1 


Thus we can find a subsequence {n’} such that a.s. 
Xt) — bat => X(2). 


Some reflection on this convergence allows the conclusion that the process 
X,(t) can be identified with X(D%, t). Therefore {X(t)} and {X,(t)} have the 
same joint distribution as {X(t)}, {X(Dé, t}. Using this fact proves the theorem. 


If a function x(t) € D([0, œ)) is, at all times ż, the sum of all its jumps 
up to time 7, then it is called a pure jump function. However, this is not 
well defined if the sum of the lengths of the jumps is infinite. Then x(t) is 
the sum of positive and negative jumps which to some extent cancel each 
other out to produce x(t), and the order of summation of the jumps up to 
time ¢ becomes important. We define: 


Definition 14.28. If there is a sequence of neighborhoods D,, | {0} such that 
F xoll) — x(r7-)) => x00), 
TSt 


then x(t) € D((0, 00)) is called a pure jump function. 


Many interesting processes of the type studied in this section have the prop- 
erty that there is a sequence of neighborhoods D,, | {0} such that 


x 
= lim dx 
f n ie + alt ) 
exists and is finite. Under this assumption we have 


Corollary 14.29, Almost every sample path of {X(t) — Bt}, t > 0, is a pure 
jump function. 


Another interesting consequence of this construction is the following: 
Take neighborhoods D,, | {0} and take {X(D£, 1}, as above, the process of 
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jumps of size greater than D,. Restrict the processes to a finite time interval 
[0, tj] and consider any event A such that for every n, 


Ae F(X(t) — X(DE, 1,0 < t< t). 
Then 


Proposition 14.30. P(A) is zero or one. 


Proof. Let B, = Dayı — Da; then the processes {X(B,, £)}, t € [0, to] are 
independent and A is measurable: 


F({(X(B,, t)}, {X(Bn+1> ty}, Peer) te (9, tol) 


for every n. Apply a slight generalization of the Kolmogorov zero-one law 
to conclude P(A) = Q or 1. 

The results of this section make it easier to understand why infinitely 
divisible laws were developed to use in the context of processes with inde- 
pendent increments earlier than in the central limit problem. The processes 
of jumps of different sizes proceed independently of one another, and the 
jump process of jumps of size [x, x + Ax) contributes a Poisson component 
with exponent function approximately equal to 


(e! — 1)u{[x, x + Ax)). 


The fact that the measure u governs the number and the size of jumps 
is further exposed in the following problems, all referring to a process 
{X(t}, £ > 0 with stationary, independent increments and exponent function 
having only an integral component. 


Problems 


9. Show that for any set B € B, bounded away from the origin, the process 
of jumps of size B, {X(B, t)}, t > 0 has stationary, independent increments 
with exponent function 


Í (e? — 1)u(dx), 
B 


and the processes {X(t) — X(B, t)}, {X(B, t)} are independent. 
10. For B as above, show that the expected number of jumps of size B 
ind <7 < 1is p(B). 
11. For Bas above let t* be the time of the first jump of size B. For C © B, 
C € B, prove that 

P(X(t*) — X(t*—) € C) = w(O)/m(B). 


12. Show that except for a set of probability zero, either all sample functions 
of {X(z)}, t € [0, ĉo] have infinite variation or all have finite variation. 
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13. Show that for {X(¢)}, £ > 0 as above, all sample paths have finite varia- 
tion on every finite time interval [0, to] if and only if 


Í |x| w(dx) < ©, 
{a]<1 


[Take ż = 1. The function 


n 
2 


Va = È |X(k/2") — X(k — 1/2”) 


k=1 
is monotonically nondecreasing. Now compute Ee~*"" for A > 0.] 


9. EXAMPLES 


Consider the first passage time t* of a Brownian motion X(t) to the point £. 
Denote Z(&) = tr. For &,, & > 0, 


thee aa th, = inf {t; X(t + ta) z X(t3,) = Ez}. 
By the strong Markov property, Z(é, + &) — Z(&) is independent of 


F (Z(E), E < &) and is distributed as Z(é,). Thus, to completely characterize 
Z(&), all we need is its characteristic function. From Chapter 13, 


etze = ero 
za ju 2e, u > 0, 
— juei, u < 0. 


This is the characteristic function of a stable distribution with exponent 4. 
The jump measure (dx) is given by 


y(u) = 


0, x <Q, 
adx) = \ dx 
=, x>0. 
c P x 


Doing some definite integrals gives c = $ vn. 
If the characteristic function of a process with stationary, independent 
increments is stable with exponent g, call the process stable with exponent 


a. ForO <a <1, 
Mele dx 
ee os 1 
[e-o 


exists. The processes with exponent functions 


i tue dx 
wu) = mf “(ee — 1 E 
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are the limit of processes {X,,(t)} with exponent functions 


dx 


l+a 


(14.31) valu) = mf e- 1) 
I/n x 
These latter processes have only upward jumps. Hence all paths of {X(t)} 
are nondecreasing pure jump functions. 
Stable processes of exponent >1 having nondecreasing sample paths do 
not exist. If £(X(t)) = £(—X(z)), the process is symmetric. Bochner [11] 
noted that it was possible to construct the symmetric stable processes from 
Brownian motion by a random time change. Take a sample space with a 
normalized Brownian motion X(t) and a stable process Z(t) defined on it 
such that Z(t) has nondecreasing pure jump sample paths and F (X(t), t > 0), 
F (Z(t), t > 0) are independent. 


Theorem 14.32. If Z(t) has exponent «, 0 < œ < 1, then the process Y(t) = 
X(Z(t)) is a stable symmetric process of exponent 2«. 


Proof. The idea can be seen if we write 
Y(t + 7) — Y(t) = X((Z(t + 7) — Z) + ZM) — X(Z). 


Then, given Z(t), the process Y(t + 7) — Y(t) looks just as if it were the 
Y(r) process, independent of Y(s), s < t. 

For a formal proof, take Z,(t) the process of jumps of Z(t) larger than 
[0, 1/n). Its exponent function is 


aig dx 
= fue 4 ` 
plu) = m Í ene 


The jumps of Z,(t) occur at the jump times of a Poisson process with intensity 


l =m ” dx 
n > 
an xite 


and the jumps have magnitude Y,, Y,,... independent of one another and 
of the jump times, and are identically distributed. Thus X(Z,(t)) has jumps 
only at the jump times of the Poisson process. The size of the kth jump is 


XY, tee + YD) — Xa Fo + OND) k> 1, 
X(%), k =], 


By an argument almost exactly the same as the proof of the strong Markov 
property for Brownian motion, U, is independent of U,_,,..., U, and has the 
same distributionas U,. Therefore X(Z,,(t)) is a process with stationary, inde- 


c= 


pendent increments. Take {n} so that Z,,(t) Sone Z(t)a.s. Use continuity of 
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Brownian motion to get X(Z,,(t)) > X(Z(t)) for every t. Thus X(Z(t)) is 
a process with stationary, independent increments. To get its characteristic 
function, write 
E(e™x(za)) |Z) = z) = Eeiex2) 
= ewe 


Therefore 
Ee@XtZ) = Fe-(/2Z) 


= enclul?, 
10. A REMARK ON A GENERAL DECOMPOSITION 


Suppose {X(t)}, £ > 0 is a process with stationary, independent increments 
and exponent function 


EEN = ou? jun = iux 

vu) = ip — TE + | (em — 1 
Since iuf — o?u?/2 is the exponent function for a Brownian motion, a 
natural expectation is that X(t) = X(t) + X@(t), where {Xo}, t > 0, 
is a Brownian motion with drift 6 and variance o”, and {X")(t)}, t > 0, is 
a process with stationary, independent increments with exponent function 
having only an integral component, and that the two processes are independ- 
ent. This is true, in fact, and can be proved by the methods of Sections 7 
and 8. But as processes with stationary, independent increments appear in 
practice either as a Brownian motion, or a process with no Brownian com- 
ponent, we neglect the proof of this decomposition. 


ata, 


NOTES 


For more material on processes with stationary, independent increments, see 
Doob (39, Chap. 8] and Paul Lévy’s two books [103 and 105]. These latter 
two are particularly good on giving an intuitive feeling for what these proc- 
esses look like. Of course, for continuous parameter martingales, the best 
source is Doob’s book. The sample path properties of a continuous param- 
eter martingale were given by Doob in 1951 [38], and applied to processes 
with independent increments. 

Processes with independent increments had been introduced by de 
Finnett in 1929, Their sample path properties were studied by Lévy in 1934 
[102]. He then proved Theorem 14.20 as generalized to processes with 
independent increments, not necessarily stationary. Most of the subsequent 
decomposition and building up from Poisson processes follows Lévy also, 
in particular [103, p. 93]. The article by Ito [75] makes this superposition 
idea more precise by defining an integral over Poisson processes. 


CHAPTER 15 


MARKOV PROCESSES, 
INTRODUCTION AND PURE JUMP CASE 


1. INTRODUCTION AND DEFINITIONS 
Markov processes in continuous time are, as far as definitions go, a 
straightforward extension of the Markov dependence idea. 


Definition 15.1. A process {X(t)}, t > 0 is called Markov with state space 
Fe B, if X(t) € F, t > 0, and for any B € B (F), t, T > 0. 


(15.2) P(X(t + 7) € B|X(8),s < t) = P(X(t + 7) € B| X(t) as. 


To verify that a process is Markov, all we need is to have for any tp, > 


taz 2zhH20, 
P(X(ta1) eB | X(t); krw X(t4)) = P(X(tas1) eB | X(tn)) a.s. 


Since finite dimensional sets determine ¥(X(s), s < t), this extends to (15.2). 

The Markov property is a statement about the conditional probability 
at the one instant ¢ + 7 in the future. But it extends to a general statement 
about the future, given the present and past: 


Proposition 15.3. If {X(t)} is Markov, then for A E€ F(X(7), 7 > t), 
P(A | X(s), s < t) = P(A | X(t) a.s. 


The proof of this is left to the reader. 


Definition 15.4. By Theorem 4.30, for every t, > t, a version p, «(B | x) of 
P(X(t,) € B | X(t) = x) can be selected such that P1,,t,(B | x) is a probability 
on B (F) for x fixed, and B (F) measurable in x for B fixed. Call these a set of 
transition probabilities for the process. 


Definition 15.5. (-) on B,(F) given by m(B) = P(X(0) € B) is the initial 
distribution for the process. 
The importance of transition probabilities is that the distribution of the 


process is completely determined by them if the initial distribution is specified. 
319 
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This follows from 


Proposition 15.6. For B,,..., By E BF), ty > tyr >'t > ty =O, 


(15.7) P(X(t,) E Bn -..,X(t,) E By, X(t) E Bo) 
=Í ot f f Pintaa (dXn | Xna) Pet 641 | Xo)77(dx9). 
n B,JBo 


The proof is the same as for 7.3. 

A special case of (15.7) are the Chapman-Kolmogorov equations. 
Reason this way: To get from x at time 7 to B at time t > 7, fix any inter- 
mediate time s, £ > s > 7. Then the probability of all the paths that go 
from x to B through the small neighborhood dy of y at time s is given 
(approximately) by p; (B | y)p, (dy | x). So summing over dy gets us to the 
Chapman-Kolmogorov equations, 


(15.8) P,(B| x) = Í Paa(B | YBa (dy | x). 


Actually, what is true is 
Proposition 15.9. The equation (15.8) holds a.s. with respect to P(X(7) € dx). 
Proof. From (15.7), for every C € &,(F), 


P(X(t) € B, X(7) € C) = Í | Pr(B| y)p.,:(dy | x)P(X(7) € dx). 


Taking the Radon derivative with respect to P(X(z) € dx) gives the result. 


One rarely starts with a process on a sample space (Q, F, P). Instead, 
consistent distribution functions are specified, and then the process con- 
structed. For a Markov process, what gets specified are the transition 
probabilities and the initial distribution. Here there is a divergence from 
the discrete time situation in which the one-step transition probabilities 
P(X, € B| X, = x) determine all the multiple-step probabilities. There are 
no corresponding one-step probabilities here; the probabilities {p, „(B | x)}, 
t > 7 > 0 must all be specified, and they must satisfy among themselves 
at least the functional relationship (15.8). 


2. REGULAR TRANSITION PROBABILITIES 
Definition 15.10. A set of functions {p, (B | x)} defined for allt > 7 > 0, 
Be BF), x € F is called a regular set of transition probabilities if 


1) p,,(B | x) ts a probability on B,(F) for x fixed and B,(F) measurable in x 
for B fixed, 
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2) for every BE BF), xe F,t>s>7>0, 


P,AB| x) = | Pia(B | y)Ps,A(dy | x). 


To be regular, then, a set of transition probabilities must satisfy the 
Chapman-Kolmogorov equations identically. 


Theorem 15.11. Given a regular set of transition probabilities and a distribution 
a(dxq), define probabilities on cylinder sets by: fort, > tni >'t > ty = 90, 
Ba... , By E BF), 


(15.12) P(X(t,) € Bn.. - , X(to) € Bo) 
=Í = all Pinstn-s©4%n | Xn—1) *** Pi, (dX: | Xo) (Axo). 


These are consistent and the resultant process {X(t)} is Markov with the given 
functions p, ,, 7 as transition probabilities and initial distribution. 


Proof. The first verification is consistency. Let B, = F. The expression 
in P(X(t,) € B,,...) that involves xy, tẹ is an integration with respect to 
the probability defined for Be B,(F) by 


[Paral | X Pinta (dXk | Xx). 
By the Chapman-Kolmogorov equations, this is exactly 


Ptt B | Xp—1), 


which eliminates ż,, x, and gives consistency. Thus, a bit surprisingly, the 
functional relations (15.8) are the key to consistency. Now extend and 
get a process {X(z)} on (Fl), BI" (F)). To verify the remainder, let 
A € B,(F). By extension from the definition, 


P(X(t,) € B, (X(tya) - - - » X(te)) € A) 
= Pintaa (B | Xn) P(X(ty—1) € adXn-1 oneg X(to) E dxo). 
JEA 


(n_1,. 00% aq 
Now, take the Radon derivative to get the result. 

One result of this theorem is that there are no functional relationships 
other than those following from the Chapman-Kolmogorov equations that 
transition probabilities must satisfy in general. 

Another convention we make starting from a regular set of transition 
probabilities is this: Let P,,,,(-) be the probability on the space of paths 
Fle), gotten by “starting the process out at the point x at time 7.” More 
specifically, let P,, ,,(-) be the probability on B!”)(F) extended from (15.12), 
where fg = 7 and w(dx,) concentrated on the point {x} are used. So PnC) 
is well-defined for all x, 7 in terms of the transition probabilities only. 
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Convention. For C € ¥(X(s), s < 7), A € BF) always use the version of 
P(X(r + -)€ A | X(7) = x, C) given by P,, (A). 


Accordingly, we use the transition probabilities not only to get the distribu- 
tion of the process but also, to manufacture versions of all important 
conditional probabilities. The point of requiring the Chapman-Kolmogorov 
equations to hold identically rather than a.s. is that if there is an x, B, 
t > s > 7 > 0, such that 


TOE Í p,AB\ y)P.(dy | 2), 


then these transition probabilities can not be used to construct “the process 
starting from x, 7.” 

Now, because we wish to study the nature of a Markov process as 
governed by its transition probabilities, no matter what the initial distribu- 
tion, we enlarge our nomenclature. Throughout this and the next chapter 
when we refer to a Markov process {X(t)} this will no longer refer to a single 
process. Instead, it will denote the totality of processes having the same 
transition probabilities but with all possible different initial starting points x 
at time zero. However, we will use only coordinate representation processes, 
so the measurability of various functions and sets will not depend on the 
choice of a starting point or initial distribution for the process. 


3. STATIONARY TRANSITION PROBABILITIES 


Definition 15.13. Let {p,,} be a regular set of transition probabilities. They 
are called stationary if for allt > + > 0, 


Pe (B |9) = Pi- (B | x). 


In this case, the p(B | x) = p,o(B | x) are referred to as the transition prob- 
abilities for the process. 


Some simplification results when the transition probabilities are stationary. 
The Chapman-Kolmogorov equations become 


(15.14) Pir (B |x) = f pAB | y)p(dy | x). 


For any Á € BF), P(A) = Piao (A); the probabilities on the path- 
space of the process are the same no matter when the process is started. 
Denote for any A € BIO)(F), f(-) on Fl“) measurable BY" (F), 


(15.15) P(A) = Pinol A), 


Eaf() = [rove.tan. 
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be tee ce Ee ns P 
Assume also from now on that for any initial distribution, X(t) —> X(0) as 
t —> 0. This is equivalent to the statement, 


Definition 15.16. Transition probabilities p,(B | x) are called standard if 


D 
P& | x) — Sal), 
or all x € F, where 6, (+) denotes the distribution with unit mass on {x}. 
{x} 


There is another property that will be important in the sequel. Suppose 
we have a stopping time t* for the process X(t). The analog of the restarting 
property of Brownian motion is that the process X(¢ + t*), given everything 
that happened up to time t*, has the same distribution as the process X(t) 
starting from the point X(t*). 


Definition 15.17. A Markov process {X(t)} with stationary transition prob- 
abilities is called strong Markov if for every stopping time t* [see (12.40)], 
every starting point x € F, and set A € $F), 


P (XC + t*) E A|X(), 5 < t*) = Pya (X() E€ A) as. P,- 


Henceforth, call a stopping time for a Markov process a Markov time. 
It’s fairly clear from the definitions that for fixed 7 > 0, 


(15.18) PA(XC + 7)€A| X(z)) = Px (XC) E A) a.s. P,. 
In fact, 


Proposition 15.19. If t* assumes at most a countable number of values {r,}, 
then 
P,(XC + t*)€A | X(s), s < t*) = Px (XC) € A) a.s. P,. 


Proof. Take C € ¥(X(s),s < t*). Let g(x) be any bounded measurable 
function, g(x) = E,g(X(t)). We prove the proposition first for A one- 
dimensional. Take g the set indicator of A, then 


(15.20) [oe +t*))dP, => Le X(t + 7)) dP, 


By definition, C N {t* = ty} € F(X(s), s < 74), so 


Í . P(X + 7,)) dP, = ( À 
CO =r} JCM = 


z Í AX) AP, 
CO =r} 


which does it. The same thing goes for p, a measurable function of many 
variables, and then for the general case. 

It is not unreasonable to hope that the strong Markov property would 
hold in general. It doesn’t! But we defer an example until the next chapter. 


Elo xt + 7) | X(s), sS < Ta] dP, 
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One class of examples of strong Markov processes with standard sta- 
tionary transition probabilities are the processes with stationary, independent 
increments, where the sample paths are taken right-continuous. Let X(t) 
be such a process. If at time z the particle is at the point x, then the distribu- 
tion at time ¢ + 7 is gotten by adding to x an increment independent of the 
path up to x and having the same distribution as X(7). Thus, these processes 
are Markov. 


Problems 


1. For X(t), a process with stationary, independent increments, show that 
it is Markov with one set of transition probabilities given by 


PAB | x) = P(X() € B — x), 
and that this set is regular and standard. 


2. Show that processes with stationary, independent increments and right- 
continuous sample paths are strong Markov. 


3. Show that the functions P,(A), E,f(-) of (15.15) are B,(F)-measurable. 


4. Show that any Markov process with standard stationary transition 
probabilities is continuous from the right in probability. 


4. INFINITESIMAL CONDITIONS 


What do Markov processes look like? Actually, what do their sample 
paths and transition probabilities look like? This problem is essentially one 
of connecting up global behavior with local behavior. Note, for example, 
that if the transition probabilities p, are known for all ¢ in any neighborhood 
of the origin, then they are determined for all 1 > 0 by the Chapman- 
Kolmogoroy equations. Hence, one suspects that p, would be determined 
for all ¢ by specifying the limiting behavior of p, as t+ 0. But, then, the 
sample behavior will be very immediately connected with the behavior of 
Pp, near t = 0. 

To get a feeling for this, look at the processes with stationary, independent 
increments. If it is specified that 


P,(IX(t) — x| > €) = o(t), 


then the process is Brownian motion, all the transition probabilities are 
determined, and all sample paths are continuous. Conversely, if all sample 
paths are given continuous, the above limiting condition at £ = 0 must hold. 

At the other end, suppose one asks for a process X(t) with stationary, 
independent increments having all sample paths constant except for isolated 
jumps. Then (see Section 6, Chapter 14) the probability of no jump in the 
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time interval (0, z] is given by e~”, so 
P,(X(t) = 0) = 1 — At + off). 


If there is a jump, with magnitude governed by F(x), then for Be B, 


{0} ¢ B, 
P,(X(t) € B) = AtF(B) + oft). 


Conversely, if there is a process X(t) with stationary, independent increments, 
and a A, F such that the above conditions hold as ¢ — 0, it is easy to check 
that the process must be of the jump type with exponent function given by 


y(u) = af. (e! — 1)F(dx). 


In general now, let {X(t)} be any Markov process with stationary transi- 
tion probabilities. Take f(x) a bounded %,-measurable function on F. 
Consider the class of these functions such that the limit as ¢ | 0 of 
E,f(X(t)) — FW 


(15.21) 
t 


exists for all xe F. Denote the resulting function by (Sf)(x). S is called 
the infinitesimal operator and summarizes the behavior of the transition 
probabilities as t > 0. The class of bounded measurable functions such 
that the limit in (15.21) exists for all x € F, we will call the domain of S, 
denoted by D(S). For example, for Poisson-like processes, 


E.f(X(@)) = Ef (X0) + x) 
= it SO + Fly) + SU — 24 + 0(0. 
(0) 

Define a measure (B; x) by 

MB; x) = AF(B-— x), x¢ B, and M(x}; x) = 0. 
Then S for this process can be written as 

NO = |CO) -Suay x) 

In this example, no further restrictions on f were needed to make the limit 
as t — 0 exist. Thus D(S) consists of all bounded measurable functions. 


For Brownian motion, take f(x) continuous and with a continuous, 
bounded second derivative. Write 


fo) =f) + 0 — of + PED =F (r(x) + 8. = D) 
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where 
6,(h) < sup [f"(x + $) —f’)I- 
, IIS IAI 

From this, 

E.f (X(t) = Ejf (X(t) + x) 

= f(x) + (EXO + RE). 

Use 

EX*%t)=1, R(t) < EX*()6,(X(O) = of), 
to get 


(neo =t Eso 
2dx? 
In this case it is not clear what D(S) is, but it is certainly not the set of all 
bounded measurable functions. These two examples will be typical in this 
sense: The jumps in the sample paths contribute an integral operator 
component to S; the continuous nonconstant parts of the paths contribute 
a differential operator component. 

Once the behavior of p, near t = 0 is specified by specifying S, the 
problem of computing the transition probabilities for all ż > 0 is present. 
S hooks into the transition probabilities in two ways. In the first method, 
we let the initial position be perturbed. That is, given X(0) = x, we let a 
small time 7 elapse and then condition on X(r). This leads to the backwards 
equations. In the second method, we perturb on the final position. We 
compute the distribution up to time ¢ and then let a small time 7 elapse. 
Figures 15.1 and 15.2 illustrate computing P,(X(t + 7) € B). 


Backwards Equations 
P(X(t + 7) € B| X0) = x) = E[P(X(t + 7) © B | X(z))| XO) = x]. 
Letting ,(x) = p,(B| x), we can write the above as 


Putx) = Ep (XO) or ig (x) — dx) = Ep (X(7)) — p). 


T t+r 0 t t+r 
Fig. 15.1 Backwards equations. Fig. 15.2 Forwards equations. 
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Dividing by 7, letting + — 0, if p,(B| x) is smooth enough in ¢, x, we 
find that 


f) 
(15.22) F PAB | x) = (SPB | D), 
that is, for Brownian motion, the backwards equations are 
ð 1 2 
— p(B =-— p{B| x). 
ry p(B | x) ax? pB | x) 
For the Poisson-like processes, 
0 
$ PAB |x) =| (PAB 1») — pdB1»)a(dy;»9 


Forwards Equations. For any f for which Sf exists, 
Ef (XE + 7)) = EEA f(K(t + 7))1XO)] 


or 
Í fOPu(dy |x) = Í | | f@)pdz | »| p,(dy | x). 


Subtract £, f(X(t)) from both sides, divide by 7, and let r | 0. With enough 
smoothness, 


[ 7095, tay 1x) = [isporeddy 9. 
Thus, if S has an adjoint S*, the equations are 
(15.23) z pAB | x) = (S* pC | x))(B). 
For Poisson-like processes, 
| Popravio = fS GO —F0))maes 9 ptaylo 
so the forwards equations are 
Ž PAB x) = {iB; y)pddy 1>), 
where A(B; y) = u(B; y), y ¢ B, and A({y}; y) = —u(F; y). If pi(dy | x) « 
u(dy) for all t, x, then take S* to be the adjoint with respect to u(dy). For 


example, for Brownian motion p,(dy | x) « dy. For all f(y) with continuous 
second derivatives vanishing off finite intervals, 


fir owo» ay. [ohg py | ») dy, 
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where p,(y | x) denotes (badly) the density of p,(dy | x) with respect to dy. 
Hence the forwards equation is 


The forwards or backwards equations, together with the boundary 


condition p,(- | x) =. ÔC) as t > 0 can provide an effective method of 
computing the transition probabilities, given the infinitesimal conditions. 
But the questions regarding the existence and uniqueness of solutions are 
difficult to cope with. It is possible to look at these equations analytically, 
forgetting their probabilistic origin, and investigate their solutions. But 
the most illuminating approach is a direct construction of the required 
processes. 


5. PURE JUMP PROCESSES 


Definition 15.24. A Markov process {X(t)} will be called a pure jump process 
if, starting from any point x € F, the process has all sample paths constant 
except for isolated jumps, and right-continuous. 


Proposition 15.25. If (X(t)} is a pure jump process, then it is strong Markov. 
Proof. By 15.19, for œ bounded and measurable B,, and 
E,9(X(t) eae X(to)) = G(x), 


we have for any C € F (X(t), t < t*), where t* takes on only a countable 
number of values and t* | t*. 


{ PXC, + Š), X(to + t3) dP, = i) P(XŽ)) dP z- 
Ç Cc 


Since t* < tž, this holds for all C €e F(X(t), £ < t*). Since all paths are 
constant except for jumps and right-continuous, 


PXE) + PX), 
P(Xlte + th), -- -)— (X(t, + t*) ..-) 


for every sample path. Taking limits in the above integral equality proves 
the proposition. 


Assume until further notice that {X(t)} is a pure jump process. 
Definition 15.26. Define the time T of the first jump as 
T = inf{t; X(t) Æ X(0)}. 
Proposition 15.27. T isa Markov time. 
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with 0 < A(x) < œ. If A(x) = 0, then the state x is absorbing, that is, 
P,(X(t) = x) = 1, for all t. The measurability of A(x) follows from the 
measurability of P,(T > t). 


Corollary 15.29. For a pure jump process, let 
P(T > t) = ®t, P,(X(T) € B) = p(B; x). 
Then at every point x in F, 
Pix |x) = 1 — A) + olt), 
PAB | x) = (A(x)t + o(t)) p(B; x), x¢B. 


Proof. Suppose that we prove that the probability of two jumps in time ¢ 
is o(t); then both statements are clear because 


P,(X(t) = x) = P,(X(t) = x, no more than one jump) + o(t) 
= PAT > t) + oft) 
1 — A(x)t + olt). 


Similarly, 


P,(X(t) € B) 


P,(X(t) € B, no more than one jump) + o(t) 
= PT < t, X(T) € B) + oft) 
= (A(x)t + 0(t))p(B; x). 


Remark. The reason for writing o(t) in front of p(B; x) is to emphasize that 
o(t)/t 0 uniformly in B. 


We finish by 


Proposition 15.30. Let T, be the time of the first jump, Ta + T, the time of the 
second jump. Then 
P{T) +T, < t) = oft). 


Proof. P,(T9 + Tı < t) < PTa < tT < t). Now, 
PAT, < tI To) = E,[P, < t| To, X(To)) | To], 


so 
PAT, < tT) = | (1 — e*)p(dy; x). 


This goes to zero as t+ 0, by the bounded convergence theorem. The 
following inequality 


PATa +T, <) <0 eH): fa = ady 50), 


now proves the stated result. 
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Proof. Let T, = inf{k/2"; X(k/2") = X(0)}. For every œw, T,, | T. Further, 
Tr St} = U {X(K/2") # XO} E F(X, s < t) 
kf2"St 
The sets {T„ < t} are monotonic increasing, and their limit is therefore in 
F (X(s), s < t). If tis a binary rational, the limit is {T < ¢}. If not, the limit 


is {T < t}. But, 
{T < t} = {T < 1} U XK) = X(0)}. 


The basic structure of Markov jump processes is given by 


Theorem 15.28. Under P,, T and X(T) are independent and there is a B,- 
measurable nonnegative function A(x) on F such that 


PAT > t) = et 
Proof. Let t + T, be the first exit time from state x past time ¢; that is, 
p 


t + 7, = inf{r; X(7) # x,7 > th. 
Then, for x ¢ B, 


P,(X(T) € BT > t) = P,(X(T) € B,T > t, X(t) = x) 
= P(X + T,)€ BT > 1, X(t) = x) 
= P,(X(t + T,)€ BIT >t, X(t) = x) 
“P(T > t, X(t) = x) 
if P,(T > t) > 0. Assume this for the moment. Then, 
T > i} = {T < 1} € F(X(8),s < t), 
P(X + T) EBIT >r, X(t) = x) = P(X + T,) € B| X(t) = x) 


=P,(X(T) € B). 
Going back, we have 


P,(X(T) € B, T > t) = P,(X(T) € B)P,T > 1). 


There must be some ft) > 0, such that P(T > t) > 0. Take 0 <t < t; 
then 
PT >t+7)=P(T>t+7,T >t X(t) =x) 
= PT, > 7| X(t) = x, T > t)P,T > t) 
= PT > 7)P,(T > 2). 


Therefore y(t) = P,(T > t) is a monotonic nonincreasing function satisfying 
g(t + 7) = p(t)p(7) for t < tə This implies that there is a parameter 
A(x) such that 

PAT > 1) = A, 
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Define a measure u(dy; x) by 


p(B; x) = Ax)p(B;x), u(x}, x) = 0. 


The infinitesimal operator S is given, following 15.30, by 


(15.31) SNO = |CO — Foo)dys», 

and D(S) consists of all bounded B, (F) measurable functions. The following 
important result holds for jump processes. 

Theorem 15.32. p(B | x) satisfies the backwards equations. 

Proof. First, we derive the equation 


(15.33) p(B] x) = za) 
t 
+] [pe AB yyuddys xe dr, 


The intuitive idea behind (15.33) is simply to condition on the time and 
position of the first jump. To see this, write 


P,(X(t) € B) = P,(X(t) € B, T > t) + P,(X(t) € B,T < t). 


The first term is ¥,(x)e*™!. Reason that to evaluate the second term, if 
Tédr, and X(T) € dy, then the particle has to get from y to B in time 
t — r. Hence the second term should be 


Í | p.(B| y)P,(X(T) € dy)P,(T € dz), 


and this is exactly the second term in (15.33). A rigorous derivation could 
be given along the lines of the proof of the strong Markov property. But it 
is easier to use a method involving Laplace transforms which has wide 
applicability when random times are involved. We sketch this method: 
First note that since X(t) is jointly measurable in t, w, p(B | x) = E,x—(X(t)) 
is measurable in t. Define 


Write Da =f; "24B [ayar = E| | XO) ar|, s>0. 


fo a) T “ 
Í ey (X(t) dt = if ey B(X) dt +Í XBCX) dt. 


The first term is 


ZCO ==), 


332 MARKOV PROCESSES; INTRODUCTION, PURE JUMP CASE 15.6 


The second is 
a 
et Í ey (X(t + T)) dt, 
0 


By the strong Markov property, 


E, Ga Í "ot n(X(t +T) dt XO), T < 1) = 1p (XT), 
Hence : 
1 A(x) 
s + A(x) zs s + A(x) 


This is exactly the transform of (15.33), and invoking the uniqueness theorem 
for Laplace transforms (see [140]) gets (15.33) almost everywhere (dt). For 
{X(t)} a pure jump process, writing p,(B | x) as E,y_(X(1)) makes it clear 
that p,(B | x) is right-continuous. The right side of (15.33) is obviously con- 
tinuous in time; hence (15.33) holds identically. 

Multiply (15.33) by e? and substitute t — + = 7’ in the second term: 


t 
e4(2'p(B | x) = xa(x) + Í f £2" p(B | yyuldys x) dr. 


p(x) = n(x) py) play; x). 


Hence p,(B | x) is differentiable, and 
ð i 
ae PBI x) = e" "f PAB | dy; x). 


An easy simplification gives the backwards equations. 

The forwards equations are also satisfied. (See Chung [16, pp. 224 ff.], 
for instance.) In fact, most of the questions regarding the transition proba- 
bilities can be answered by using the representation 


æ 
(15.34) Zs(X0) = È innnan (DEBOR), 
where R, is the time of the nth jump, R, = 0, so 
R =T +e +T n> 0, 


where the T, are the first exit times after the kth jump. We make use of this 
in the next section to prove a uniqueness result for the backward equation. 


Problem 5. Show that a pure jump process {X(t)}, ¢ > 0 is jointly measurable 
in (t, œ) with respect to B,([0, 00)) x F. 


6. CONSTRUCTION OF JUMP PROCESSES 


In modeling Markov processes what is done, usually, is to prescribe 
infinitesimal conditions. For example: Let F be the integers, then a popula- 
tion model with constant birth and death rates would be constructed by 
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specifying that in a small time At, if the present population size is j, the 
probability of increasing by one is rgj At, where rp is the birth rate. The 
probability of a decrease is rpj At where rp is the death rate, and the proba- 
bility of no change is 1 — rpj At — rgj At. What this translates into is 


PG + If) = rgjt + 00), k Fj, 
PAj—11f) = rojt + of), 
PUD = 1 — (Cp + rpjt + oD), 
pdk lj) = o(t), otherwise. 


In general, countable state processes are modeled by specifying g(k |j) > 0, 
aj) = Z gk lf) < œ, such that 
KZJ 
PAK I/) = Kk jt + oft), k Aj, 
PANDY = 1 — qt + off. 


General jump processes are modeled by specifying finite measures 
HB; x), measurable in x for every B € B,(F), such that for every x, 


PB |x) = WB; x)t + oft), x ¢B, 
Px} |x) = 1 — w(F x)t + ot). 


Now the problem is: Is there a unique Markov jump process fitting into 
(15.35)? Working backward from the results of the last section—we know 
that if there is a jump process satisfying (15.35), then it is appropriate to 
define 
A(x) = wx), pB; x) = MBs x)/MFS x), p(x}; x) = 0, 
and look for a process such that 
PT > t) =°, P (X(T) € B) = p(B; x). 
Theorem 15.28 gives us the key to the construction of a pure jump process. 
Starting at x, we wait there a length of time T, exponentially distributed with 
parameter A(x); then independently of how long we wait, our first jump is 
to a position with distribution p(dy; x). Now we wait at our new position 
y, time T, independent of Ty, with distribution parameter A{y), etc. Note 
that these processes are very similar to the Poisson-like processes with 
independent increments. Heuristically, they are a sort of patched-together 
assembly of such processes, in the sense that at every point x the process 
behaves at that point like a Poisson-like process with parameter A(x) and 
jump distribution given by p(B; x). 
At any rate, it is pretty clear how to proceed with the construction. 


(15.35) 


1) The space structure of this process is obtained by constructing a discrete 
Markov process X,, X,, Xa, . . -, moving under the transition probabilities 


p(B; x), and starting from any point x. 
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2) The time flow of the process consists of slowing down or speeding up the 
rate at which the particle travels along the paths of the space structure. 


For every n, x € F, construct random variables T,,(x) such that 

i) P(T > t) = eo Mt, 

ii) T,,(x) are jointly measurable in a, x, 

iii) the processes (X, . . .), (To(x), x € F), (T(x), x € F),... are mutually 
independent. 


The T,,(x) will serve as the waiting time in the state x after the nth jump. 
To see that joint measurability can be gotten, define on the probability 
space (0, 1], B((0, 1]), dz) the variables T(z, 4) = —(1/A)logz. Thus 
P(T(z, A) > t) = e>. Now define T(x) = T(z, A(x)) and take the cross- 
product space with the sample space for X,, X,,... Similarly, for T,(x), 
T,(x),... 
For the process itself, proceed with 
Definition 15.36. Define variables as follows: 


Ry = 0, Ra = To(Xo) + Ti(%1) + °° + Ty a(X,4), 
n¥(t)=n, if R, < t< Ray, 
X(t) = Xnr 
In this definition R, functions as the time of the nth jump. 
Theorem 15.37. If n*(t) is well-defined by 15.36 for all t, then X(t) is a pure 


jump Markov process with transition probabilities satisfying the given infini- 
tesimal conditions. 


Proof. This is a straightforward verification. The basic point is that given 
X(t) = x, and given, say, that we got to this space-time point in » steps, then 
the waiting time in x past ¢ does not depend on how long has already been 
spent there; that is, 


P(T) > 7 + ETC) > E) = eo”, 
To show that the infinitesimal conditions are met, just show again that the 
probability of two jumps in time ¢ is o(¢). 
The condition that n*(t) be well-defined is that 


wo 
Ro = Z TX) = © as. Pp all x. 
0 


This is a statement that at most a finite number of jumps can occur in every 
finite time interval. If P,(R,, < 00) > 0, there is no pure jump process that 
satisfies the infinitesimal conditions for all x € F. However, even if R,, = œ 
a.s., the question of uniqueness has been left open. Is there another Markov 
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process, not necessarily a jump process, satisfying the infinitesimal conditions ? 
The answer, in regard to distribution, is No. The general result states that if 
PAR, < œ) = 0, all x € F, then any Markov process {X(¢)} satisfying the 
infinitesimal conditions (15.35) is a pure jump process and has the same 
distribution as the constructed process. For details of this, refer to Doob 
[39, pp. 266 ff.]. We content ourselves with the much easier assertion: 


Proposition 15.38. Any two pure jump processes having the same infinitesimal 
operator S have the same distribution. 


Proof. This is now almost obvious, because for both processes, T and 
X(T) have the same distribution. Therefore the sequence of variables X(T,), 
X(T, + T,), . . . has distribution governed by p(B; x), and given this sequence, 
the jump times are sums of independent variables with the same distribution 
as the constructed variables {R,}. 


Let {X(r)} be the constructed process. Whether or not R, = œ a.s., 
define p(B | x) as the probability that X(t) reaches B in time ¢ in N or 
fewer jumps. That is, 


p(B | x) 


P,(X(t) E B, Ry+ı > t) 

N 

> PX, € B, Rn < t < Rp41). 
n=0 


Forn > l, andr < t, 
PAX, E B, To Ht + Tra SE << Toti + Ty X = y, Ty = 7) 

= P(X,EB Ti te + Tra St-—-7< Ti +71 4+ TT, X =) 

= P(X, E B, To +: + Tra <t- r 

< Tot- + Ta) as. p(dy; x). 
The terms for n > 1 vanish for 7 > t, and the zero term is 
P(X EB, Ry > t) = gg); 

hence integrating out X,, To gives 


Proposition 15.39 
t 
PBL) = olde M+ |] p | yuldys spe * ar 


Letting N — œ gives another proof that for a pure jump process, integral 
equation (15.33) and hence the backwards equations are satisfied. 
Define 


(15.40) BAB | x) = lim p(B x). 
N 


The significance of p(B | x) = lim, p')(B | x) is that it is the probability 
of going from x to B in time ¢ in a finite number of steps. 
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Proposition 15.41. p,(B | x) is the minimal solution of the backwards equations 
in the sense that if q,(x) is any other solution satisfying 


1) qx) = 9, 
2) qx) > xp(x) as t— 0, 
then 


P(B | x) < qx). 


Proof. The backwards equation is 
G 
Z ads) = AD + |, asuldy > 
Or {x} 
Multiply by e742, integrate from 0 to ¢, and we recover the integral equation 


t 
arts) = a | S ady; e dr, 


Assume ¢,(x) > p&Y(B| x). Then substituting this inequality in the integral 
on the right, 


t 
gx) È g(x)" +Í Í p(B | yyu(dy; xe 42" dr 
0 J {x}° 
= p(B | x). 
By the nonnegativity of q,, 
qx) > xo) = p(B | x). 
Hence q,(x) > lim p(B | x) = p(B | x). 
N 


Corollary 15.42. If {X(t)} is a pure jump process, equivalently, if Ra = œ a.s. 
P,, all xe F, then {p(B | x)} are the unique set of transition probabilities 
satisfying the backwards equations. 


7. EXPLOSIONS 


If there are only a finite number of jumps in every finite time interval, 
then everything we want goes through—the forwards and backwards equa- 
tions are satisfied and the solutions are unique. Therefore it becomes impor- 
tant to be able to recognize from the infinitesimal conditions when the 
resulting process will be pure jump. The thing that may foul the process up 
is unbounded A(x). The expected duration of stay in state x is given by 
E,T = 1{A(x). Hence if A(x) — œ anywhere, there is the possibility that the 
particle will move from state to state, staying in each one a shorter period 
of time. In the case where F represents the integers, A(n) can go to œ only 
ifn —> œ. In this case, we can have infinitely many jumps only if the particle 
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can move out to oo in finite time. This is dramatically referred to as the 
possibility of explosions in the process. Perhaps the origin of this is in a 
population explosion model with pure birth, 


PAG+ If) = AG + of), 
PGI = 1 — age + of). 
Here the space structure is p(n + 1;n) = 1; the particle must move one 
step to the right each unit. Hence, X, = nif X, = 0. Now 


Rp E T,(0) Fone T,(n a 1) 
is the time necessary to move n steps. And 
S 
ER. = $ —. 
Re = 2 A(n) 
If this sum is finite, then R, < œ a.s. Py. This is also sufficient for 
PR, < ©) = 1, for all je F. Under these circumstances the particle 
explodes out to infinity in finite time, and the theorems of the previous 
sections do not apply. 


One criterion that is easy to derive is 


Proposition 15.43. A process satisfying the given infinitesimal conditions will 
be pure jump if and only if 


z 1 
P, < æ) =0, all x e F. 
(2 A(X) 
Proof. For $ọ T,, a sum of independent, exponentially distributed random 
variables with parameters /,, >> T, < œ a.s. iff $o 1/4, < œ. Because for 
s>0, 
1 
TR + s/A,)° 


Verify that the infinite product on the right converges to a finite limit 
iff $o 1/4, < œ, and apply 8.36 and the continuity theorem given in Section 
13, Chapter 8. Now note that given (Xo, X;,...), R, is a sum of such 
variables with parameters A(X,). 


Ee- Tite +a) = 


Corollary 15.44. If sup A(x) < œ, then {X(t)} is pure jump. 
aver 


Note that for a pure birth process }P 1/A(n) = œ is both necessary 
and sufficient for the process to be pure jump. For F the integers, another 
obvious sufficient condition is that every state be recurrent under the space 
structure. Conversely, consider 


(3 soe): 
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Let N(k) be the number of entries that the sequence Xp, X,,... makes 
into state k. Then 


anj E,N(k) 
E. = į 
i(2 i) 2 Ak) 


If this is finite, then certainly there will be an explosion. Time-continuous 
birth and death processes are defined as processes on the integers such that 
each jump can be only to adjacent states, so that 


PYG + 11 f) = bt + off), 
pai — 11 f) = at + of), 
Pj lj) = 1 — (G) + dG) + of”). 
For a birth and death process with no absorbing states moving on the non- 
negative integers, Ej(N(k)) = 1 + M(k), where M(k) is the expected number 
of returns to k given X, = k. Then, as above, 


< 1 1 + M(k) 
(3 ix) 2 
The condition that this latter sum be infinite is both necessary and sufficient 
for no explosions [12]. 

Another method for treating birth and death processes was informally 
suggested to us by Charles Stone. Let F be the nonnegative integers with 
no absorbing states. Let t* be the first passage time from state | to state n, 
and t* the first passage time from state n to state n + 1. The tf, t3,... 
are independent, t* = p tt. 


Proposition 15.45. t%, = lim t* is finite a.s. if and only if X Ett < œ. 
n 1 


Proof. Let T, be the duration of first stay in state k, then t} > T,. Further 
YP IT. < was. iff ¥P ET, < wor YP lak) < œ. Hence if inf, Atk) = 0, 
both X7 tf and X? Ett are infinite. Now assume inf, A(k) = ô > 0. Given 
any succession of states X, = n, X,,...,%,, = n + | leading from n to 
n + 1, t* is a sum of independent, exponentially distributed random vari- 
ables To + --: + Tn and o°(T, + -°-: + T,,) = o°(T,) + °°: + oT) < 
ôE, + °°: + Thn). Hence 


a(t) < SET. 
If $i Ttf converges a.s., but $p Etž = œ, then for0 < e < 1, 
P(IÈ (at — Ext) 
1 


Applying Chebyshev’s inequality to this probability gives a contradiction 
which proves the proposition. 


> eÈ Ext) =p 
1 
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Problems 


6. Show that a(k) = Et% satisfies the difference equation 


= oh 1 = 
a(k) = p(k) Kb + al a5 + a(k — 1) + a(t), k>1, 
where 

plk) = b(k)/A(k), gk) = d(k)/A(k), 


or 


1 
kja(k) = —— kK)ja(k — 1), k>1. 
p(k jak) Xo + q(k)x( ) k2 
Deduce conditions on p(k), q(k), A(k) such that there are no explosions. 
(See [87].) 

7. Discuss completely the explosive properties of a birth and death process 
with {0} a reflecting state and 


b(n) = be(n),  d(n) = dy(n). 


8. NONUNIQUENESS AND BOUNDARY CONDITIONS 


If explosions are possible, then 5,(F|x) < 1 for some x, £, and the pro- 
cess is not uniquely determined by the given infinitesimal conditions. The 
nature of the nonuniqueness is that the particle can reach points on some 
undefined “boundary” of F not included in F. Then to completely describe 
the process it is necessary to specify its evolution from these boundary points. 
This is seen most graphically when F is the integers. If P;(R,, < 00) > 0 for 
some j, then we have to specify what the particle will do once it reaches œ. 
One possible procedure is to add to F a state denoted {00} and to specify 
transition probabilities from {0} to je F. For example, we could make 
{co} an absorbing state, that is, p,({00} | {oo}) = 1. An even more interesting 
construction consists of specifying that once the particle reaches {co} it 
instantaneously moves into state k with probability Q(k). This is more 
interesting in that it is not necessary to adjoin an extra state {00} to F. 
To carry out this construction, following Chung [16], define 


PEK) = BLK) = PAX) = j, Re > 2). 
Now look at the probability p® (j| k) that k — j in time t with exactly one 
passage to {co}. To compute this, suppose that Rọ = 7; then the particle 
moves immediately to state / with probability Q(/), and then must go from 
l to j in time z — 7 with no further excursions to {00}. Hence, denoting 
(dr) = P,(R,, € dr), 


pq | k) = f È PG | Doo] Hy(dr). 
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Similarly, the probability p‘'(j| k), of k — j in time t with exactly n passages 
to {c0} is given by 


PG | k= f [z pG] Dow | H,(d7). 
Now define 


pdl =È PML. 


Proposition 15.46. p,(j| k) as defined above satisfies 
1) pG = 1, 

2) the Chapman-Kolmogorov equations, and 

3) the backwards equations. 


Proof. Left to reader. 


Remark. p,(j|k) does not satisfy the forwards equations. See Chung 
[16, pp. 224 ff.]. 


The process constructed the above way has the property that 


Pi IK = PPC k + olt). 
This follows from noting that 


t 
PAUL K) = pO |) + f (z P| DAD) H,(dr). 


The integral term is dominated by P,(R, < ¢). This is certainly less than 
the probability of two jumps in time ż, hence is o(t). Therefore, no matter 
what Q(/) is, all these processes have the specified infinitesimal behavior. 
This leads to the observation (which will become more significant in the next 
chapter), that if it is possible to reach a “boundary” point, then boundary 
conditions must be added to the infinitesimal conditions in order to specify the 
process. 


9. RESOLVENT AND UNIQUENESS 


Although S with domain D(S) does not determine the process uniquely, 
this can be fixed up with a more careful and restrictive definition of the 
domain of S. In this section the processes dealt with will be assumed to 
have standard stationary transition probabilities, but no restrictions are put 
on their sample paths. 


Definition 15.47, Say that functions p(x) converge boundedly pointwise to 
g(x) on some subset A of their domain as t > 0 if 


i) lim 9f{x) = (x), all x € A, 
t-90 
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ii) sup |px] < M < œ, for all t sufficiently small. 
axed 


Denote this by p(x) aes g(x) on A. 
Let £ be any class of bounded 8,(F)-measurable functions. Then we use 
Definition 15.48. D(S, £) consists of all functions f(x) in £ such that 


E,(f(X(t))) - f@) 
t 


converges boundedly pointwise on F to a function in £. 


We plan to show that with an appropriate choice of £, that corresponding 
to a given S, D(S, £), there is at most one process. In the course of this, we 
will want to integrate functions of the type E, f (X(#)), so we need 


Proposition 15.49. For f(x) bounded and measurable on F, (x, t) = E, f (X(t)) 
is jointly measurable in (x, t), with respect to B,(F) x B,([0, œ)). 


Proof. Take f(x) bounded and continuous. Since {X(t)} is continuous in 
probability from the right, the function g(x, t) = £,f (X(t) is continuous 
in ¢ from the right and 2,(F)-measurable in x for ¢ fixed. Consider the 
approximation 

Pn(Xs t) = Dx 


Koo [k/2",k+1/2°) 


(y(x, (k + 1/2"). 


By the right-continuity, p„(x, t) > g(x, t). But y,(x, t) is jointly measurable, 
therefore so is (x, t). Now consider the class of 8,(F)-measurable functions 
f(x) such that | f(x)| < 1 and the corresponding g(x, t) is jointly measurable. 
This class is closed under pointwise convergence, and contains all continuous 
functions bounded by one. Hence it contains all bounded measurable 
functions bounded by one. 


Definition 15.50. The resolvent is defined as 


R,(B| x) = ie en (B|x)dt, A>O 


v 


for any Be B (F), x € F. 


It is easy to check that R,(B | x) is a bounded measure on B,(F) for fixed 
x. Furthermore, by 15.49 and the Fubini theorem, R,(B| x) is 3,(F)- 
measurable in x for B € B,(F) fixed. Denote, for f bounded and 2,(F)- 
measurable, 


(RAE) = | okada. and «GPG = f fO)pAdy | x). 
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Then, using the Fubini theorem to justify the interchange, 


Rf = Í “EMTS at. 


Take S to be the set of all bounded %,(F)-measurable functions f(x) 
such that (T, f)(x) — f(x) as t — 0 for every x €e F. Note that D(S, $) = $. 


Theorem 15.51. If f €$, then R,f is in D(S, $) and 


1) (A — S\Rif) = 
Iff is in D(S, $), then 
2) R(A- 9f) = f. 


Proof. If fis in $, then since T, and R, commute, the bounded convergence 
theorem can be applied to T(R, f) = R,(T,/) to establish Rf €$. Write 


TRS) = T| TA) dr = | TTN) dr 
The Chapman-Kolmogorov equations imply TT, f) = T,.,f, so 


TR) = |R if- fe e” (T, Nar). 


From this, denoting 


we get 


-At àt pi 
(15.52) S(R,f) = ferme ie = [ e(T, f) dr. 


t 


Using || f| = sup | fd], x € F, we have 


—At 
ISRAI < =Ë If + tae) IfI; 


Ea 0<t<1. 


As ¢ goes to zero, (e-** — 1)/t — —A. As 7 goes to zero, (T, f Xx) > f(x). 
Using these in (15.52) completes the proof of the first assertion. 
Now take fin D(S, $); by the bounded convergence theorem, 


R(Q — $)f) = lim R(Q — Sof). 


Note that R, and S, commute, so 


R (G — S)f) = lim A — SRS). 
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By part (1) of this theorem, R, f is in D(S, $), hence 
lim (A — SARS) = A —SXRS) = f- 
The purpose of this preparation is to prove 


Theorem 15.53. There is at most one set of standard transition probabilities 
corresponding to given S, D(S, $). 


Proof. Suppose there are two different sets, p and p!?) leading to resolvents 
R® and R. For f in $, let 


g = (RP — RP YF. 
Then, by 15.51(1), 
(A — Sg = 0. 


But g € D(S, S), so use 15.51(2) to get 


RP(A — S)g) = g. 


Therefore g is zero. Thus for all fe $, RU f= RPS Since $ includes all 
bounded continuous functions, for any such function, and for all 4 > 0, 


Í "HTDP dt = Í "HTP P) dt. 
0 0 


By the uniqueness theorem for Laplace transforms (see [140], for example) 
(TOP)\(x) = (T!F)(x) almost everywhere (dt). But both these functions are 
continuous from the right, hence are identically equal. Since bounded 
continuous functions separate, p\)(B | x) = p'?(B| x). 


The difficulty with this result is in the determination of D(S, 8). This 
is usually such a complicated procedure that the uniqueness theorem 15.53 
above has really only theoretical value. Some examples follow in these 
problems. 


Problems 


8. For the transition probabilities constructed and referred to in 15.46, 
show that a necessary condition for f(j) to be in D(S, $) is 


IID) = 0. 
9. Show that for any 2’, 4” > 0, 
Ry f = Ruf + (7 hua AYR ARS). 


Use this to show that the set R, consisting of all functions {R,f}, f in $, 
does not depend on A. Use 15.51 to show that R, = D(S, $). 
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10. For Brownian motion, show that the resolvent has a density r,(y | x) 
with respect to dy given by 


rly |x) = (1/,/2A)e7* 2aly—z| f 


11. Let C be the class of all bounded continuous functions on R™. Use 
the identity in Problem 9, and the method of that problem to show that 
D(S, C) for Brownian motion consists of all functions f in C such that 
fF’) is in C. 

12. For a pure jump process, show that if sup A(x) < œ, then D(S, S) 
consists of all bounded %,(F)-measurable functions. 


10. ASYMPTOTIC STATIONARITY 


Questions concerning the asymptotic stationarity of a Markov process 
{X(t)} can be formulated in the same way as for discrete time chains. In 
particular, 


Definition 15.54. wdx) on B (F) will be called a stationary initial distribution 
for the process if for every B € BF) and t > 0, 


#(B) = | p(B | x)#(dx). 


Now ask, when do the probabilities p,(B | x) converge as t — œ to some 
stationary distribution 7(B) for all x € F? Interestingly enough, the situation 
here is less complicated than in discrete time because there is no periodic 
behavior. We illustrate this for X(t) a pure jump process moving on the 
integers. Define the times between successive returns to state k by 


tT 


inf {t; X(t) = k, T < 1 such that X(7) ¥ k}, 
inf {t; X(t +t?) =k, ate <r < t + t such that X(7) Æ k}, 
and so forth. By the strong Markov property, 


ty 


Proposition 15.55. If P,(t* < œ) = 1, then the tt, t3, ... are independent 
and identically distributed. 


If P,(t* < œ) < 1, then the state k is called transient. 
To analyze the asymptotic behavior of the transition probabilities, use 


XO = k= U Chee He cy ek pee + eh ga Tye > 4, 
n=0 


where T,(k) is the duration of stay in state k after the nth return. It is 
independent of t¥,...,t*, and P,(T,(k) > t) = e>, So 


pik} k) = PP 4 SPT ee et the eke Tk) > 2). 
n=1 


15.10 ASYMPTOTIC STATIONARITY 345 


Put 


R(t) = X PAT +++ + th < 0); 
then i 


pak | k) = eA Í P (Tolk) > &)R(t — d8). 


Argue that tf is nonlattice because 

tr = Tok) + tr, 
where t* is the time from the first exit to the first return. By the strong 
Markov property, To(k), tř are independent. Finally, note that T)(k) has 
a distribution absolutely continuous with respect to Lebesgue measure; 


hence so does t*. 
Now apply the renewal theorem 10.8. As £ > œ, 


w dé 
R(t —d —, 
(t £) — Et* 


Conclude that 
A 1 Pana 
lim p(k | k) =— | e> dt. 
; Pil | ) eh, 
Hence 
Proposition 15.56. Let T be the first exit time from state k, tf the time of 
first return. If Et’ < œ, then 
ET 
(15.57) pik | k) > — = a(k). 
Et, 


The following problems concern the rest of the problem of asymptotic 
convergence. 


Problems 


13. Let all the states communicate under the space structure given by p(/; k), 
and let the expected time of first return be finite for every state. Show that 
1) For any k, j, 
lim pj | k) = 70), 
t 
where 7(j) is defined by (15.57). 
2) If #(&) is the stationary initial distribution under p(/; k), that is, 


7O) = > pi; Dalk), 
then 5 
aC j) = an(j)/ACj), 


where « is normalizing constant. 
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14. Show that if k is a transient state, then p,(k | k) goes to zero exponentially 
fast. 

15. Show that ~ is a stationary initial distribution for a process having 
standard stationary transition probabilities if and only if 


Í e(y)m(dy) = 0 


for all g such that there exists an f in D(S, $) with g = Sf. 


NOTES 


K. L. Chung’s book [16] is an excellent reference for the general structure 
of time-continuous Markov chains with a countable state space. Even with 
this simple a state space, the diversity of sample path behavior of processes 
with standard stationary transition probabilities is dazzling. For more 
general state spaces and for jump processes in particular, see Doob’s book 
[39, Chap. 6]. For a thoroughly modern point of view, including discussion 
of the strong Markov property and the properties of S, D(S,S) and the 
resolvent, see Dynkin [44, especially Vol. 1]. 

The fundamental work in this field started with Kolmogorov [93, 1931]. 
The problems concerning jump processes were treated analytically by 
Pospisil [117, 1935-1936] and Feller in 1936, but see Feller [52] for a fuller 
treatment. Doeblin [26, 1939] had an approach closer in spirit to ours. Doob 
[33 in 1942] carried on a more extended study of the sample path properties. 
The usefulness of the resolvent and the systematic study of the domain of S 
were introduced by Feller (57, 1952]. His idea was that the operators 
{T,}, 20, formed a semi-group, hence methods for analyzing semi-groups 
of operators could be applied to get useful results. 

There is an enormous literature on applications of pure jump Markov 
processes, especially for those with a countable state space. For a look at 
some of those, check the books by Bharucha-Reid [3], T. E. Harris [67], 
N. T. J. Bailey [2], and T. L. Saaty [120]. An extensive reference to both 
theoretical and applied sources is the Bharucha-Reid book. 


CHAPTER 16 


DIFFUSIONS 


1. THE ORNSTEIN-UHLENBECK PROCESS 


In Chapter 12, the Brownian motion process was constructed as a model 
for a microscopic particle in liquid suspension. We found the outstanding 
nonreality of the model was the assumption that increments in displacement 
were independent—ignoring the effects of the velocity of the particle at the 
beginning of the incremental time period. We can do better in the following 
way: 
Let V(z) be the velocity of a particle of mass m suspended in liquid. 
Let AV = V(t + At) — V(t), so that m AV is the change in momentum 
of the particle during time Ar. The basic equation is 


(16.1) mAV = —BV At + AM. 


Here — AV is the viscous resistance force, so —BV At is the loss in momen- 
tum due to viscous forces during Ar. AM is the momentum transfer due to 
molecular bombardment of the particle during time At. 

Let M(t) be the momentum transfer up to time ¢. Normalize arbitrarily 
to M(0) = 0. Assume that 
i) M(t + Ar) — M(t) is independent of ¥(M(z), 7 < t), 
ii) the distribution of AM depends only on At, 
iii) M(t) is continuous in ¢. 


The third assumption may be questionable if one uses a hard billiard-ball 
model of molecules. But even in this case we reason that the jumps of M(t) 
would have to be quite small unless we allowed the molecules to have enor- 
mous velocities. At any rate (iii) is not unreasonable as an approximation. 

But (i), (ii), (ili) together characterize M(¢) as a Brownian motion. The 
presence of drift in M(t) would put a u At term on the right-hand side of 
(16.1). Such a term corresponds to a constant force field, and would be 
useful, for example, in accounting for a gravity field. However, we will 
assume no constant force field exists, and set EM(t) = 0. Put EM2(t) = o*t; 
hence M(t) = oX(t), where X(t) is normalized Brownian motion. Equation 
(16.1) becomes 


(16.2) mAV = —BV At + c AX. 
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Doing what comes naturally, we divide by Az, let At + 0 and produce the 
Langevin equation 

dV dX 
16.3 m— = —fV +0 —. 
a dt f dt 
The difficulty here is amusing: We know from Chapter 12 that dX/dt exists 
nowhere. So (16.3) makes no sense in any orthodox way. But look at this: 
Write it as 
dX(t) 

dt 


where « = f/m, y = ojm. Assume V(0) = 0 and integrate from 0 to ¢ 
to get 


AG (t)) = ye 


? 


vit) =y { 7 dX(2). 


Do an integration by parts on the integral, 


t 
eN) = ye™X(t) — ve X(r)e* dr. 
0 


Now the integral appearing is for each w just the integral of a continuous 
function and makes sense. Thus the expression for V(t) given by 


V(t) = rf ein dX(r) 


can be well defined by this procedure, and results in a process with con- 
tinuous sample paths. 
To get a more appealing derivation, go back to (16.2). Write it as 


A(e*V) = pe AX + (An, 


where 6(At) = o(At) because by (16.2), V(t) is continuous and bounded 
in every finite interval. By summing up, write this as 


n—1 a 
ENOS y Ye (KUR) — XEM), 
k=0 
where 0 = 4m < --- < t™ = t is a partition $, of [0, t). If the limit of 
the right-hand side exists in some decent way as ||S,,|| — 0, then it would be 


very reasonable to define e*‘V(t) as this limit. Replace the integration by 
parts in the integral by a similar device for the sum, 


n-1 n-i 
I Xita) _ X(t,)) = e'X(1) — > (erti — 0'*)X(t, 41). 
Fan k=0 


The second sums are the Riemann-Stieltjes sums for the integral fi X(7) d(e”). 
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For every sample path, they converge to the integral. Therefore: 


Definition 16.4. The Ornstein-Uhlenbeck process V(t) normalized to be zero 
at t = 0, is defined as 


VO = [ee axe), 


where the integral is the limit of the approximating sums for every path. 
Proposition 16.5. V(t) is a Gaussian process with covariance 

EVV) = petit! — ener), 
where p = y?/2a. 


Proof. That V(t) is Gaussian follows from its being the limit of sums 
>, v(t.) A,X, where the A,X = X(t,.1) — X(t,) are independent, normally- 
distributed random variables. To get I'(s, t), take s > t, put 


OHH St, z=tmtyai<e<t,h=s. 
Write 
V(s)= y 3 eA, VO Sy > ett AX, 
Use E(A,X)(A;X) = 0 if k # j, E(A,X)* = t.4, — tw to get 
EV(s)V(t) = 7? 2 eet HECh a — ty). 
Going to the limit, we get 


t 
EV(s)V(t) = pre tt Í et da. 


0 


As t — œ, EV(t)? — p, so V(t) 2; V(co), where V(0o) is N(0, p). What 
if we start the process with this limiting distribution? This would mean that 
the integration of the Langevin equation would result in 


t 
0 


e” V(t) — V,(0) = rf e 1) dX(7). 
Define the stationary Ornstein-Uhlenbeck process by 
(16.6) Y() = V(t) + e*'V,(0), 
where V,(0) is N’(0, p) and independent of F (V(t), t > 0). 


Proposition 16.7. Y(t) is a stationary Gaussian process with covariance 
T\(s, t) = pimi, 


Proof. Direct computation. 
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Remark. Stationarity has not been defined for continuous parameter 
processes, but the obvious definition is that all finite-dimensional distribu- 
tions remain invariant under a time shift. For Gaussian processes with zero 
means, stationarity is equivalent to F(s, £) = (ls — t)). 


The additional important properties of the Ornstein-Uhlenbeck process 
are: 


Proposition 16.8. Y(t) is a Markov process with stationary transition prob- 
abilities having all sample paths continuous. 


Proof. Most of 16.8 follows from the fact that .or 7 > 0, Y(t + 7) — 
e~Y(z) is independent of ¥(Y(s),s < 7). To prove this, it is necessary 
only to check the covariance 
E((v(t +7) - “**¥(r))¥(s)} = T(t + 7, s) — T(r, s) 
= 0, SST. 
Now, 
P(Y(t +7)E A| Y(T) = x, Y(s),5 < 7) 
= P(Y(t +7) — e™Y(r) e A — x | Y(T) = x, Y(s), 5 < 7) 
= P(Y(t + 7) — Y(t) € A — ex). 
The random variable Y(t + 7) — e-?‘Y(r) is normal with mean zero, and 
E(¥(t + 7) — &*'¥(x)/? 
= E[(¥(t + 7) — eN + D) = p(l e). 
Thus p,(- | x) has the distribution of 
N (etx, p(1 — e7?#4)), 


The continuity of paths follows from the definition of V(¢) in terms of an 
integral of X(t). 


Problems 


1. Show that if a process is Gaussian, stationary, Markov, and continuous 
in probability, then it is of the form Y(t) + c, where Y(r) is an Ornstein- 
Uhlenbeck process. 

2. Let Z be a vector-valued random variable taking values in R(™, m > 2. 
Suppose that the components of Z, (Zi, Zp,..., Zm) are independent and 
identically distributed with a symmetric distribution. Suppose also that 
the components have the same property under all other orthogonal coordinate 
systems gotten from the original one by rotation. Show that Z,,..., Zm 
are JV(0, 0°). 
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Remark. The notable result of this problem is that any model for Brownian 
motion in three dimensions leads to variables normally distributed providing 
the components of displacement of velocity along the different axes are 
independent and identically distributed (symmetry is not essential, see Kac 
[79]) irrespective of which orthogonal coordinate system is selected. How- 
ever, it does not follow from this that the process must be Gaussian. 


2. PROCESSES THAT ARE LOCALLY BROWNIAN 


In the spirit of the Langevin approach of the last section, if Y(t) is 
Brownian motion with drift u, variance o°, then write 


AY = u At + o AX. 


The same integration procedure as before would then result in a process 
Y(t) which would be, in fact, Brownian motion with parameters u, o. To try 
to get more general Markov processes with continuous paths, write 


(16.9) AY = p(Y) At + o(Y) AX. 


As before X(t) is normalized Brownian motion. Y(t) should turn out to be 
Markov with continuous paths and stationary transition probabilities. Argue 
this way: u(Y) At is a term approximately linear in r, but except for this term 
AY is of the order of AX, hence Y(t) should be continuous. Further, assume 
Y(t) is measurable F(X(7), 7 < t). Then the distribution of AY depends 
only on Y(z), (through “(Y) and o(Y)), and on AX, which is independent of 
F(¥(z), 7 < t) with distribution depending only on Ar. 

Roughly, a process satisfying (16.9) is Jocally Brownian. Given that 
Y(t) = y, it behaves for the next short time interval as though it were a 
Brownian motion with drift u(y), variance o(y). Therefore, we can think of 
constructing this kind of process by patching together various Brownian 
motions. Note, assuming AX is independent of Y(#), 


E(AY | Y(t) = y) = u(y) åt, 
E((AY)? | Y() = y) = PEAX) + o(At) = 0%) At + (Ad). 
Of course, the continuity condition is also satisfied, 
P(IAY| > e| Y@) = y) = o(Ad. 
Define the truncated change in Y by 
AY if |AY| < «e, 
A.Y = 
0, otherwise. 


As a first approximation to the subject matter of this chapter, I will say that 
we are going to look at Markov processes Y(t) taking values in some interval 
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F, with stationary transition probabilities p, satisfying for every « > 0, and 
y in the interior of F, 


P(IAY| > e| ¥(0) = y) = o(An), 
(16.10) E(A.Y | Y(t) = y) = u(y) At + of An), 
E((A.Y)? | Y(t) = y) = (y) At + o(AD), 


and having continuous sample paths. Conditions (16.10) are the infinitesimal 
conditions for the process. A Taylor expansion gives 


Proposition 16.11, Let f(x) be bounded with a continuous second derivative. 
If p(dy | x) satisfies (16.10), then (Sf)(x) exists for every point x in the interior 
of F and equals 


ay BS) df(x) 
tox) + me) 


Thus, 


Proposition 16.12. If the transition probabilities satisfy (16.10), and have 
densities p(y | x) with a continuous second derivative for x € int(F), then 


2 
1613) 2 py) = 46°F p 19 + ue) È pay 0, x E mth). 
t ax ax 


Proof. This is the backwards equation. 


Problem 3. Show by direct computation that the transition probabilities for 
the Ornstein-Uhlenbeck process satisfy 


0 g? re] 
z pAy | x) = 40° a Py | x) — ax E PAy | x). 


3. BROWNIAN MOTION WITH BOUNDARIES 


For X(t) a locally Brownian process as in the last section, the infini- 
tesimal operator S is defined for all interior points of F by 16.11. Of course, 
this completely defines S if F has only interior points. But if F has a closed 
boundary point, the definition of S at this point is not clear. This problem 
is connected with the question of what boundary conditions are needed 
to uniquely solve the backwards equation (16.13). To illuminate this problem 
a bit, we consider two examples of processes where F has a finite closed 
boundary point. 


Definition 16.14. Use X(t) to denote normalized Brownian motion on R™, 
p% (dy | x) are its transition probabilities. 


The examples will be concerned with Brownian motion restricted to the 
interval F = [0, 00). 
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Example 1. Brownian motion with an absorbing boundary. Take F = [0, œ). 
The Brownian motion X(f) starting from x > 0 with absorption at {0} is 
defined by 

X(t) = Xo (min (t, t3)), 


where X,(t) is started from x. 


It is not difficult to check that X(t) is Markov with stationary transition 
probabilities. To compute these rigorously is tricky. Let A < (0, œ), 
A E$, and consider P,(X,(t) € A,t* < t). The set {X(t) € A, të < t} 
consists of all sample paths that pass out of (0, œ) at least once and then come 
back in and get to A by time t. Let A+ be the reflection of A around the 
point x = 0. Argue that after hitting {0} at time r < t it is just as prob- 
able (by symmetry) to get to A+ by time z as it is to get to A, implying 


(16.15) P (X(f) € A, tò < t) = PAX() E A, ty < 2). 


This can be proven rigorously by approximating t by stopping times that 
take only a countable number of values. We assume its validity. Proceed 
by noting that {X,(t) E€ A+} © {t* < t} so that 

P (X(t) E A+, th < t) = P,(X,(t) € A+) 


= p™(A+* | x). 
Now, 


P,(X(t) € A) = P,(Xo(t) € A, tọ > t) 
= P (X(t) € A) — P,(Xo(t) € A, tọ < t) 
= pP(A | x) — p(A™ | x). 
The density for p®(A+ | x) is pf(—y | x). Thus 


(16.16) Py |x) = PCy |x) — pey | x). 


Example 2. Brownian motion with a reflecting boundary. Define the Brownian 
motion X(t) on F = [0, 00) with a reflecting boundary at {0} to be 


X(t) = |X), 


where we start the motion from x > 0. What this definition does is to take 
all parts of the X,(t) path below x = 0 and reflect them in the x = 0 axis 
getting the X(t) path. 
Proposition 16.17, X(t) is Markov with stationary transition probability 
density 

PO |x) = p(y |x) + pi(—y |»). 
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Proof. Take A € &,([0, «)), x > 0. Consider the probabilities 
P(X(t + tr) € A] X(t) = x, X(9),5 < t), 
P(X(t + EA! X) = —x, X,(5), 5 < t). 
Because X,(t) is Markov, these reduce to 
P(|Xo(t + 7)| € A | X(t) = x), 
P(\Xo(t + 7)| € A | X(t) = —x). 
These expressions are equal. Hence 
P(X(t + 7) € A| X(t) = x, X(s), s < t) 
E(P(X(t + 7) € A | [Xo(t)] = x, Xo(5), s S t) | X(5), 8 < t) 
P(|Xo(t + 7)| € A | Xo(t) = x) 
= p(A |x) + pi(A | x). 


In both examples, X(t) equals the Brownian motion X,(t) until the par- 
ticle reaches zero. Therefore, in both cases, for x > 0 and f bounded and 
continuous on [0, 00), 


JES (XW) — Enf (X)| = olt). 


As expected, in the interior of F, then 


1 df (x) 


(SN) = > dx? 


for functions with continuous second derivatives. Assume that the limits 
f'O+), f’"(O+), as x | 0 of the first and second derivatives, exist. In the case 
of a reflecting boundary at zero, direct computation gives 


Ef (X) — f0) = | 5 S'0+) + f+) + o(t). 


Thus, (Sf)(0) does not exist unless f'(0+) =0. If f'(0+) = 0, then 
(Sf)(0) = 4 f"(0+), so not only is (Sf)(x) defined at x = 0, but it is also 
continuous there. 

If {0} is absorbing, then for any f, (Sf)(0) = 0. If we want (Sf)(x) to be 
continuous at zero, we must add the restriction f"(0+) = 0. 

Does the backwards equation (16.13) have the transition probabilities 
of the process as its unique solution? Even if we add the restriction that 
we will consider only solutions which are densities of transition probabilities 
of Markov processes, the examples above show that the solution is not 
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unique. However, note that in the case of absorption 
lni ply |x) = 0 


for all £ y > 0. Intuitively this makes sense, because the probability 
starting from x of being absorbed at zero before hitting the point y goes to 
one as x + 0. For reflection, use the symmetry to verify that 


. ð 
lim — =0. 
in “a PO | x) 


If either of the above boundary conditions are imposed on the backwards 
equation, it is possible to show that there is a unique solution which is a set 
of transition probability densities. 

Reflection or absorption is not the only type of behavior possible at 
boundary points. Odd things can occur, and it was the occurrence of some 
of these eccentricities which first prompted Feller’s investigation [56] and 
eventually led to a complete classification of boundary behavior. 


Problems 


4. Show that the process X(t) defined on (0, 1] by folding over Brownian 
motion, 
X(t) = |Xo(t) — 2n| 


if X(t) — 2n| < 1, is a Markov process with stationary transition proba- 
bilities such that 
Z 
Ox 


Evaluate (Sf)(x) for x € (0, 1). For what functions does (Sf)(x) exist at the 
endpoints? [This process is called Brownian motion with two reflecting 
boundaries. ] 


f] 
Piy|x)| = —pdy|x)| =0. 
0+ ox i= 


5. For Brownian motion on [0, œ), either absorbing or reflecting, evaluate 
the density r,(y | x) of the resolvent R,(dy | x). For C the class of all bounded 
continuous functions on [0, 00), show that 


a) For absorbing Brownian motion, 

DS, C) = {fe C; f” eC, f"(O+) = 0}. 
b) For reflecting Brownian motion, 

D(S, C) = {fe C; f” e C, f'0+) = 0}. 


[See Problems 10 and 11, Chapter 15. Note that in (a), R,(dy | 0) assigns all 
of its mass to {0}.] 
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4. FELLER PROCESSES 


The previous definitions and examples raise a host of interesting and difficult 
questions. For example: 


1) Given the form of S, that is, given o7(x) and u(x) defined on int(F), and 
certain boundary conditions. Does there exist a unique Markov process with 
continuous paths having S as its infinitesimal operator and exhibiting the 
desired boundary behavior? 


2) If the answer to (1) is Yes, do the transition probabilities have a density 
PQ | x)? Are these densities smooth enough in x so that they are a solution 
of the backwards equations? Do the backwards equations have a unique 
solution? 


The approach that will be followed is similar to that of the previous chapter. 
Investigate first the properties of the class of processes we want to construct— 
try to simplify their structure as far as possible. Then work backwards from 
the given infinitesimal conditions to a process of the desired type. 

To begin, assume the following. 


Assumption 16.18(a). F is an interval closed, open, or half of each, finite or 
infinite. X(t) is a Markov process with stationary transition probabilities such 
that starting from any point x € F, all sample paths are continuous. 


The next step would be to prove that 16.18(a) implies that the strong Markov 
property holds. This is not true. Consider the following counter example: 
Let F = R®. Starting from any x # 0, X(t) is Brownian motion starting 
from that point. Starting from x = 0, X(t) = 0. Then, for x # 0, 


P,(X(t + th) € B) = Po(X9(t) € B). 
But if X(t) were strong Markov, since t* is a Markov time, 


P(X(t + to) € B) = Pruss(X() € B) = xp(0). 

The pathology in this example is that starting from the point x = 0 
gives distributions drastically different from those obtained by starting from 
any point x 0. When you start going through the proof of the strong 
Markov property, you find that it is exactly this large change in the distri- 
bution of the process when the initial conditions are changed only slightly 
that needs to be avoided. This recalls the concept of stability introduced in 
Chapter 8. 


Definition 16.19. Call the transition probabilities stable if for any sequence of 
initial distributions n,{) 2y n(:), the corresponding probabilities satisfy 


P,,(X(t) e) >> P,(X(t) €-) 


for any t>0. Equivalently, for all p(x) continuous and bounded on F, 
E,@(X(t)) is continuous on F. 
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Assume, in addition to 16.18(a), 
Assumption 16.18(b). The transition probabilities of X(t) are stable. 


Definition 16.20. A process satisfying 16.18(a),(b) will be called a Feller 
process. 


Now we can carry on. 
Theorem 16.21. A Feller process has the strong Markov property. 


Proof. Let p(x) be bounded and continuous on F, then ¢(x) = E,¢(X(t)) 
is likewise. Let t* be a Markov time, t* a sequence of Markov times such 


that t* | t* and t* takes on only a countable number of values. From 
(15.20), for C e€ F(X(t), t < t*), 


[ oE + ear, =| PXE) dP 
Cc Cc 
The path continuity and the continuity of p, ¢ give 
PXE + A) > (X(t + t), 
AXE) > PXE). 

Thus, for p continuous, 

EA o(X(t + t*)) | X(r),7 < t*) = Exe (XO). 
The continuous functions separate, thus, for any B € R(F), 

P (X(t + t*) € B| X(7), 7 < t*) = Px (X(t) € B). 


To extend this, let g(x,,..., x,) on F™ equal a product 9,(x,) °° + 94(xz), 
where q, . .., 9, are bounded and continuous on F. It is easy to check that 


Ep (Xh), «++ X(4)) = F) 
is continuous in x. By the same methods as in (15.20), get 
[ote +o... Xt + EDAP, = S HEN) APs 
c c 


conclude that 


Í (X(t, + t*),..., X(t, + t*)) dP, = i G(X(t*)) dP,» 

c c 

and now use the fact that products of bounded continuous functions separate 
probabilities on B,(F). 


Of course, this raises the question of how much stronger a restriction 
stability is than the strong Markov property. The answer is—Not much! 
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To go the other way, it is necessary that the state space have something like 
an indecomposability property—that every point of F can be reached from 
every interior point. 


Definition 16.22. The process is called regular if for every x € int(F) and ye F, 


Pty < œ) >0. 
Theorem 16.23. If X(t) is regular and strong Markov, then its transition prob- 
abilities are stable. 
Proof. Let x < y < z, x € int(F). Definet, = min (t*, s), t, = min(t*, s), 
for s > 0 such that P,(t* < s) > 0. These are Markov times. Take (x) 
bounded and continuous, ¢(x) = E,p(X(t)). By the strong Markov property, 


E,p(X(t + t,)) = ELEAo(X(t + €) | Xx), 7 < E] 
= £,9(X(t,)), 
E,g(X(t + t,)) = EPX). 
Suppose that on the set {t* < œ}, t* f t* a.s. P, as y Î z, implying t, f t, 
a.s. So E,p(X(t + t,)) > £,y(X(t + t,)). The right-hand sides of the 
above equations are 


PPs < s) + Í », FXO) dP, 
and y 
PPE < s) + Í PEO) AP 


For s a continuity point of P,(t* < s), P(t? < s) > P,(t* < s), and the 
sets {t* > s} | {t7 > s}. The conclusion is that ¢(y) + ¢(z). 

The final part is to get t¥ î t¥ as y fz. Let t¥(w) < œ and denote 
q — lim t*, as y f z. On the set {t* < oo}, X(t*) = y. By the path con- 
tinuity, X(*) = z, so t = t*. By varying x to the left and right of points, 
16.23 results. 


From here on, we work only with regular processes. 


Problem 6. Let C be the class of all bounded continuous functions on F, 
For a Feller process show that if fe C, then R,fis in C. Deduce that 
D(S, C) = DS, S). Prove that there is at most one Feller process corre- 
sponding to a given S, D(S, C). (See Theorem 15.53.) 


5. THE NATURAL SCALE 


For pure jump processes, the structure was decomposable into a space 
structure, governed by a discrete time Markov chain and a time rate which 
determined how fast the particle moved through the paths of the space 
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structure. Regular Feller processes can be decomposed in a very similar way. 
The idea is clearer when it is stated a bit more generally. Look at a path- 
continuous Markov process with stable stationary transition probabilities 
taking values in n-dimensional space, X(t) € R™. Consider a set Be 8,; 
let t*(B) be the first exit time from B, then using the nontrivial fact that 
t*(B) is measurable, define probabilities on the Borel subsets of the boundary 
of B by 
QC) = P,{X(t*(B))EC), C bdB). 


The Q,(C) are called exit distributions and specify the location of X(t) upon 
first leaving B. Suppose two such processes have the same exit distributions 
for all open sets. Then we can prove that under very general conditions 
they differ from each other only by a random change of time scale [10]. 
Thus the exit distributions characterize the space structure of the process. 
To have the exit distributions make sense, it is convenient to know that the 
particle does exit a.s. from the set in question. Actually, we want and get 
much more than this. 


Theorem 16.24. Let J be a finite open interval such that J € F. Then 

sup E,t*(J) < œ. 
7 ced 
Proof. First we need 
Lemma 16.25. Let sup P,(t*(J) > t) = a < 1 for some t > 0. Then 
xed 
t 
l—o 


Proof. Let «, = sup P,(t*(J) > nt). Write 
wed 


E,t*Y) < 


P,(t*VJ) > (n + 1)t) < «,P,(t*VJ) > (n + It | t*() > nt). 
Let t*(/) be the first exit time from J, starting from time nt. Then 
P(t*(VJ) > (n + Dt) < «,P(t(J) > t | t*) > nt). 
Since {t*(J) > nt} € F(X(r),7 < nt), then 


P,(t*(J) > (n + Dt) < aay. 
Hence «, < «”, and 


E,t*(J) = Í *P,(t*(J) > 1) dr 
0 


œ (n+i)t 
=> P*O) > 7) dr 


n=0 J ni 


Lt = : 


n=0 1-2 
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To finish the proof of 16.24, let J = [a, b], pick y €J. By regularity, 
there exists a ż and « such that P(t* > t) < « < 1, PA >t) <a< l. 
For y < x < b, 


P,(t*(J) > t) < Pth > t) < Pts > t), 
and fora < x < y, 

P,{t*(J) > t) < Pth > 1) < P > t). 
Apply the lemma now to deduce the theorem. 


Remark. Note that the lemma has wider applicability than the theorem. 
Actually, it holds for all intervals J, finite or infinite. 


In one dimension the relevant exit probabilities are: 


Definition 16.26. For any openinterval J = (a, b) such that P,(t*(J) < œ) = 1, 
x €J, define 

p*(x, J) = P,(X(*(J)) = b), 

p(x, J) = 1 — pt(x, J). 


Theorem 16.27. There exists a continuous, strictly increasing function u(x) on 
F, unique up to a linear transformation, such that for J < F, J = (a, b), 


(16.28) p(x, J) = aa 


Proof. For J © J, note that exiting right from Z starting from x €J can 
occur in two ways: 

i) Exit right from J starting from x, then exit right from / starting from b. 
ii) Exit left from J starting from x, then exit right from J starting from a. 


Use the strong Markov property to conclude that for x € J, 


(16.29) P*(x, I) = p(x, J)p*(b, I) + p(x, J)p*(a, I) 

or 

p(x, I) — pla, D 

p*(b, I) — p*(a, I) 

If F were bounded and closed, then we could take u(x) = p+(x, int(F)) 
and satisfy the theorem. In general, we have to use extension. Take J, to 
be a bounded open interval, such that if F includes any one of its endpoints, 
then Å includes that endpoint. Otherwise J, is arbitrary. Define u(x) on 
Ty = (xı, X2) as p*(x, h). By the equation above, for J, € 1, x eh, 


PY, Lh) = Py Ty) 
P (xa h) — p* (x1, h) 


p(x, J) = 


(16.30) u(x) = 
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Define an extension u')(x) on J, by the right-hand side of (16.30). Suppose 
another interval /, is used to get an extension, say J, © Ja. Then for x € h, 
we would have 


+ _ pt 
u(x) = ae I>) P tei Ty) : 
P (X2, 12) — p (xi, I2) 
For x € J, (16.29) gives 


p(x, l) = cyp* (x, L) + cz 


Substitute this into (16.30) to conclude that u®(x) = u(x) on 7,- Thus 
the extensions are unique. Continuing this way, we can define u(x) on int(F) 
so that (16.28) is satisfied. 

It is increasing; otherwise there exists a finite open J, J © F, and x €J, 
such that pt(x,J) = 0. This contradicts regularity. Extend u, by taking 
limits, to endpoints of F included in F. Now let J,, be open, J, T J = (a, b). 
I assert that t*(J,) T t*(J). Because t*(J,) < t*(J), by monotonicity t* = 
lim, t*(J,,) exists, and by continuity X(t*) = a or b. For any « > 0 and n 
sufficiently large, 

p Ja) = P(X(t*(V,)) > b — €). 


Since X(t*(J,,)) > X(t*(J)) a.s., taking limits of the above equation gives 
P Ja) > P(x, 4). 


By taking either a, | a or b, f b we can establish the continuity. The fact 
that u(x) is unique up to a linear transformation follows from (16.28). 


We will say that a process is on its natural scale if it has the same exit 
distributions as Brownian motion. From (13.3), 


Definition 16.31. A process is said to be on its natural scale if for every 
J=(a,b),J¢ F, 

Hea =. 

Bas = 


that is, if u(x) = x satisfies (16.28). 


The distinguishing feature of the space structure of normalized Brownian 
motion is that it is driftless. There is as much tendency to move to the right 
as to the left. More formally, if J is ‘any finite interval and x, its midpoint, 
then for normalized motion, by symmetry, p*(x9,J) = 4. We generalize 
this to 


Proposition 16.32. A process is on its natural scale if and only if for every 
finite open J, J < F, xq the midpoint of J, 


P(X, J) = 3. 
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Proof. Consider n points equally spaced in J, 

a=xXy<'° Sx, =. 
Starting from zx, the particle next hits 2,_, or 2,4; with equal proba- 
bility, so p+ (xx, (2k-1,2k41)) = 4. Therefore, the particle behaves like a 
symmetric random walk on the points of the partition. From Chapter 7, 


Section 10, 
xX, — a 


k 
b-—a 


The continuity established in Theorem 16.27 completes the proof of 16.32. 


P (Xp, J) = 


Let the state space undergo the transformation X = u(x). Equivalently, 
consider the process 


(16.33) X(t) = u(X(t)). 


If X(t) is a regular Feller process, then so is X(t). The importance of this 
transformation is: 


Proposition 16.34. X(t) is on its natural scale. 


Proof. Let J = (4, b), ã = u(a), b = u(b), J = (a, b). For the X(t) process, 
with ¥ = u(x), 


p, J) = P,(X(e*(J)) = 6) 
PXU) = b) 


u(x) — u(a) 
u(b) — u(a) 


Qe 


b-4 

For any regular Feller process then, a simple space transformation gives 
another regular Feller process having the same space structure as normalized 
Brownian motion. Therefore, we restrict attention henceforth to this type 
and examine the time flow. 


Remark. The reduction to natural scale derived here by using the trans- 
formation ¥ = u(x) does not generalize to Feller processes in two or more 
dimensions. Unfortunately, then, a good deal of the theory that follows 
just does not generalize to higher dimensions. 


6. SPEED MEASURE 


The functions m(x,J) = E,t*(J) for open intervals J, determine how 
fast the process moves through its paths. There is a measure m(dx) closely 
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associated with these functions. Define, for J = (a, b) finite, 


2(x — ab — y) x,y EJ, x<y, 
b—a 
CES) Cae HY ONO De per aS ¥, 
b—a 
0, otherwise. 


Then, 


Theorem 16.36. Let X(t) be on its natural scale. Then there is a unique measure 
m(dx) defined on &,(int F), m(B) < œ for B bounded, B © int(F), such that 
for finite open J, J < F, 


m(x, J) = | G(x, y)m(dy). 


Proof. The proof will provide some justification for the following 
terminology. 


Definition 16.37. m(dx) is called the speed measure for the process. 


Consider (a, b) partitioned by points a = x» < x; < +++ < x, = b, where 
x, = a + kô. Define Jy, = (x41, Xe41). Note that m(x,,J,) gives some 
quantitative indication of how fast the particle is traveling in the vicinity of 
Xy. Consider the process only at the exit times from one of the J, intervals. 
This is a symmetric random walk Zo, Z}, . . . moving on the points X9,..., Xn» 
Let n(x;; x,) be the expected number of times the random walk visits the 
point x,, starting from x, before hitting x» or x„ Then 


n-l 


(16.38) m(x; J) = > (xj; x )m(X,, Ja). 


This formula is not immediately obvious, because t*(J) and X(t*(J)) are not, 
in general, independent. Use this argument: Let t% be the time taken for 
the transition Zy — Zy,,. For x € {%,..., Xa}, 


n—-1 
E,ty = E{EAtn | Zy) s > P(Zy = Xy)M(Xz, Ja). 
k=1 
Sum over N, noting that 
n(x;; x) = Ea > ZantZy)) = Š P,(Zy = %). 
N=0 N=0 


This function was evaluated in Problem 14, Chapter 7, with the result, 


1 
n(x;; Xa) = 3 G(X; Xy). 


364 DIFFUSIONS 16.6 


Defining (dx) as a measure that gives mass m(x,, J,)/6 to the point x, 
we get 


m(x, J) = f G(x, YAY). 


Now to get m(dx) defined on all int(F). Partition F by successive refinements 
3 having points a distance 6, apart, with 6, +0. Define the measure m,, 
as assigning mass 

m(x;,, Xr- Xr+1))/ Ôn 
to all points x, € 3‘ which are not endpoints of the partition. For a, b, 
x,E5™, J = (a,b), JEF, 


(16.39) m(x; J) = { Gj(x;, y)m,(dy). 


For any finite interval Z such that J © int(F), (16.39) implies that lim, m,(D< 
œ. Use this fact to conclude that there exists a subsequence m, >> m, where 
mis a measure on 8,(int F) (see 10.5). Furthermore, for any J = (a, b), and 


x €J, where a, b, x are in US, 
n 


(16.40) m(x, J) = | G(x, y)m(dy). 


For any arbitrary finite open interval J, J c F, take J, < J where J, has 
endpoints in U, 5 and pass to the limit as J, f J, to get (16.40) holding 
for J and any x e (J, 9. To extend this to arbitrary x, we introduce an 
identity: If J © J, I = (a, b), x € I, then the strong Markov property gives 
(16.41) m(x, J) = m(x, I) + pt(x, Dm(b, J) + p(x, D(a, J). 


Take J finite and open, J € F, xeJ, y <z < x and y ze U Ī™® AJ. 
Use (16.41) to write 


(16.42) m(z, J) = m(z, (y, x)) + p+ (z, O, x) n(x, J) + p~(e, (y, x) my, J). 
Take z fî x. By (16.40), m(z, (y, x)) > 0, so 

lim m(z, J) = m(x, J). 
Since the integral f G(x, y)m(dy) is continuous in x, then (16.40) holds for 
allx eJ. But now the validity of (16.40) and the fact that the set of functions 


{Gy(x, y)}, a, b, x eint(F), are separating on int(F) imply that m(dy) is 
unique, and the theorem follows. 


One question left open is the assignment of mass to closed endpoints 
of F. This we defer to Section 7. 
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Problems 
7. If X(¢) is not on its natural scale, show that m(dx) can still be defined by 


mx, J) = | G y(x, ymd») 
by using the definition 


COELO CORON 


u(b) — u(a) 
Gx, Y) = { (uly) — u(a))(u(b) — u(x) 
May Ste See 
0, otherwise. 


8. For X(t) Brownian motion with zero drift, show that for J = (a, b), 
E,t*(J) = (x — GMb — x) k 
o 
Use this to prove 
1 
m(dy) = z dy. 


[Another way to see the form of m(dy) for Brownian motion is this: m,(dy) 
assigns equal masses to points equally spaced. The only possible limit 
measure of measures of this type is easily seen to be a constant multiple 
c dy of Lebesgue measure. If cy is the constant for normalized Brownian 
motion, a scale transformation of X(t) shows that c = ¢,/o?.] 


9. For x € int(F) and J,, open neighborhoods of x such that J, | {x}, show 
that t*(J,) 225 0. Deduce that t*({x}) = 0 a.s. P,. 


10. For f(x) a bounded continuous function on F, J any finite open interval 
such that J € F, prove that 


t*(J) 
E| f SADa] =f, G DOMA) 
[Use the approximation by random walks on partitions.] 


7. BOUNDARIES 


This section sets up classifications that summarize the behavior at the 
boundary points of F. If F is infinite in any direction, say to the right, call 
+œ the boundary point on the right; similarly for — o. 

For a process on its natural scale the speed measure m(dx) defined on 
int(F) will to a large extent determine the behavior of the process at the 
boundaries of F. For example, we would expect that knowing the speed 
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measure in the vicinity of a boundary point b of F would tell us whether the 
process would ever hit b, hence whether b was an open or closed endpoint 
of F. Here, by open endpoint, we mean b ¢ F, by closed, b e F. In fact, 
we have 


Proposition 16.43. If b is a finite endpoint of F, then b ¢ F if and only if for 
J © F any nonempty open neighborhood with b as endpoint, 


f, 1b = yi ma») = 

J 

Remark. A closed endpoint of F is sometimes called accessible, for obvious 
reasons. 


Proof. Assume b is a right endpoint. If b € F, then there exists a ¢ such that 
for c e int(F), P (t* > t) = a < 1. LetJ = (c, b). Then P,(t*(J) > 1) < a, 
all x e J. Use Lemma 16.25 to get m(x, J) < œ,allxeJ. ForbeF,J c F; 
hence 16.36 holds: 


m(x, J) = f G(x, y)m(dy) 


2(x — c) 
b— c Jiz» 


2 |b — y| m(dy). 


Therefore the integral over (x, b) is finite. Conversely, if the integral is finite 
for one c € int(F), then as z Î b, if b ¢ F, and x > c, 


E,t? = lim m(x, (c, z)) =Í, Gy(x, y)m(dy), J = (c, b). 


The left-hand side of this is nondecreasing as x ? b. But the integral equals 


2(b — —c 
f o- oman + Ef le- ymd), 
which goes to zero as x Î b. 
For open boundary points b, there is two-way classification. 
Definition 16.44. Let x €e int(F), y € int(F), y —> b monotonically. Call b 
natural if for allt > 0, lim P,(t* < t) = 0; 
y—>b 
entrance if there is a t > 0 such that lim P,(t¥ < t) > 0. 
yo) 


A natural boundary behaves like the points at œ for Brownian motion. 
It takes the particle a long time to get close to such a boundary point and 
then a long time to get away from it. An entrance boundary has the odd 
property that it takes a long time for the particle to get out to it but not to 
get away from it. 
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Proposition 16.45. Let b be an open boundary point. 


a) Ifb is finite, it must be natural. 
b) Ifb is infinite, it is natural if and only if for J = F any open interval with b 
as an endpoint, 


Í Iyl m(dy) = 00. 
J 


Proof. Take b finite and right-hand, say. If b is an entrance point, then 
for J = (c, b), use Lemma 16.25 to get m(x, J) < M < œ for all x EV. 
This implies, if we take J, | J and use the monotone convergence theorem, 
that 


f (b — y)m(dy) < o. 


This is impossible, by 16.43, since b is an open endpoint. For b = œ, check 
that 


lim m(x, (a, c)) = af (y — a)dm + Ax — a) dm. 
If b = œ is entrance, then there is an a such that for J = (a, œ) 


PAt*(J) > th = Ptr >) <1—a<1 
for all x e J. Use Lemma 16.25 to get 


[[o- adm +@—a["dm <M < o, peak 


Taking x — œ proves that fy |y| dm < œ. Conversely, if the integral is 
finite for J = (a, œ), then there exists an M < œ such that for all a < x < 
c < œ, m(x, (a,c)) < M. Assume b = œ is not an entrance boundary. 
For any 0 < e < }, take c, x such that a < x < cand 


P(t <2M) <e, P(t% < 2M) < €. 
Then 
m(x, (a, ¢)) > 2MP,(t*((a, c)) > 2M) > 2M(1 — 26) > M, 


and this completes the proof. 


There is also a two-way classification of closed boundary points b. This 
is connected with the assignment of speed measure m({b}) on these points. 
Say b is a closed left-hand endpoint. We define a function G(x, y) for all 


intervals of the form J = [6, c), and x, y € J as follows: Let J be the re- 
i 4 
flection of J around b, that is, J = (b — (c — b), b] and y the reflection of 
4 
y around b, y = b — (y — b). Then define 


abs 
(16.46) Gy(x, y) = Grud») + Gyrus, y) 
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where Gu} is defined by (16.35). This definition leads to an extension of 
(16.40). 

Theorem 16.47. It is possible to define m({b}) so that for all finite intervals 
J = [b,c), JE F, 


m(x, J) =| G(x, yma. 
Remark. Note thatif f,, m(dy) = œ, then no matter how m({b}) is assigned, 
m(x, J) = œ. This leads to 


Definition 16.48. Let b be a closed boundary point, J © F any finite open 
interval with b as one endpoint such that J is smaller than F. Call b 


regular if mJ) < œ, 

exit if mJ) = œ. 
Proof of 16.471. The meaning of the definition of Gy will become more 
intuitive in the course of this proof. Partition any finite interval [b, c) by 
equally spaced points b = x» < xı <--+* <x, = C, a distance ô apart. 


Define J, = (Xp1, Xe), k = 1,...,n — 1, Jo = [Xo X1). By exactly the 
same reasoning as in deriving 16.38, 


n-1 
m(x; J) = 2 n(x; Xe)M(Xrs Je), 


where n‘*(x,; x,) is the expected number of visits to x, of a random walk 
Zo, Zis... ON Xq,..., Xn Starting from x;, with reflection at x, and absorp- 
tion at x,. Construct a new state space x_,,..., X- Xq,---,Xn, Where 


ber 
alt; : r . 
x_, = Xp and consider a symmetric random walk Zo, Z,,... on this space 


A 
with absorption at x_,, x, Use the argument that Z, = |Z, — xol + Xo 
is the reflected random walk we want, to see that for x, # Xo, 


n(x; xe) = ACX Me) + AOG Xa), 
where 7 (x;, Xx) refers to the Z random walk, and 
n(x; xo) = (X55 xo). 
Hence, for G(x, y) as defined in (16.46), 
l Gy(x;,%,), k #9, 


nO) = 


1 
36 G(X; Xo), k=0. 
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For m(x,,J,), k > 1, use the expression f J, Fs,(%e> yym(ay). Then 
1 n—l 1 
n(x, J) = F G(X Xo)(Xo; Jo) +f ( = GIJX; XG g (Xr ») m(dy). 
(b,c) \k=1 ô 


The integrand converges uniformly to G z(x;, y) as ô — 0. Hence for b not 
an exit boundary, the second term of this converges to 


Í G(x; ym(dy) 
J—{b} 


as we pass through successive refinements such that ô— 0. Therefore, 
m(Xq, Jo)/26 must converge to a limit, finite or infinite. Define m({b}) to be 
this limit. Then for x, any point in the successive partitions, 


m(x, J) = Í, G(x, y)m(dy). 


For m({b}) < œ, extend this to all x € J by an argument similar to that in 


the proof of 16.36. For m({b}) = œ, show in the same way that m(x, J) = œ 
for all x EJ. 


Note that the behavior of the process is specified at all boundary points 
except regular boundary points by m(dy) on &,(int F). Summarizing graphi- 
cally, for b a right-hand endpoint, 

Classification Type 
int(F)7 6 natural boundary 
int(F)Ż b entrance boundary 
int(F) 3 b exit boundary 
int(F) 2 b regular boundary 


The last statement (— for regular boundary points) needs some explanation. 
Consider reflecting and absorbing Brownian motion on [b, 00) as described 
in Section 3. Both of these processes have the same speed measure m(dy) = 
dy on (b, œ), and b is a regular boundary point for both of them. They 
differ in the assignment of m({b}). For the absorbing process, obviously 
m({b}) = œ; for the reflecting process m({b}) < œ. Hence, in terms of 
m on int(F) it is possible to go int(F) — b and b — int(F). Of course, the 
latter is ruled out if m({b}) = œ, so = should be understood above to mean 
“only in terms of m on int(F).” 


Definition 16.49. A regular boundary point b is called 
absorbing if m({b}) = œ, 
slowly reflecting if 0 < m({b}) < œ, 
instantaneously reflecting if m/({b}) = 0. 


See Problem 11 for some interesting properties of a slowly reflecting boundary. 
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Problems 


11. Show, by using the random walk approximation, that 


(c — b)m({b}) = E,(i{t; X(t) = b, t < 33). 
Conclude that if m({b}) > 0, then X(t) spends a positive length of time at 
point b => p,({b} | b) > 0 for some ż. Show also that for almost all sample 
paths, {t; X(t) = 5} contains no intervals with positive length and no 
isolated points. 


12. For b an entrance boundary, J = (b, c), c € int(F), show that 


mtb, J) =| (e — smd) 


where m(b, J) = lim m(x, J), x — b. 
13. For b a regular boundary point, J = [b, c), J c F, show that 


mb, J) =| ly — el (dy). 


14. Forban exit boundary, and any J = (a, b), x € J, show that m(x, J) = % 
(see the proof of 16.47). Use 16.25 to conclude that P,(t* < œ) = 0. Hence, 
deduce that p,({b} | b) = 1, t > 0. 


8. CONSTRUCTION OF FELLER PROCESSES 


Assume that a Feller process is on its natural scale. Because X(t) has 
the same exit probabilities as Brownian motion X,(t), we should be able 
to construct a process with the same distribution as X(t) by expanding or 
contracting the time scale of the Brownian motion, depending on the current 
position of the particle. Suppose m(dx) is absolutely continuous with respect 
to Lebesgue measure, 
m(dx) = V(x) dx, 
where V(x) is continuous on F. For J small, 
m(x, J) = Vixjm(x, J), 


where m (x, J) is the expected time for X,(t) to exit from J. So if it takes 
Brownian motion time Arf to get out of J, it takes X(t) about V(x) At to get 
out of J. Look at the process X,(T(t)), where T(t) = T(t, w) is for every w 
an increasing function of time. If we want this process to look like X(t), then 
when ¢ changes by V(x) At, we must see that T changes by the amount Ar. 
We get the differential equation 


a 


T 1 


t V(x) 


a 
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We are at the point x = X,(T(t)), so 
aT 1 


dt V(X,(T)) 
Integrating, we get 


or V(X,(T)) dT = dt. 


Í ' V(X0(E)) dé = t. 


Hence, 


Definition 16.50. Take X,(t) to be Brownian motion on F, instantaneously 
reflecting at all finite endpoints. Denote 


K) = Í “V(Xe(8)) dé. 


Define T(t) to be the solution of \(r) = t, that is, 


T(t) 
t= Í V(Xo(E)) dé. 
0 


Remark. Because m(J) > 0 for all open neighborhoods J, {x; V(x) = 0} can 
contain no open intervals. But almost no sample function of X,(t) can have 
any interval of constancy. Hence I(7) is a strictly increasing continuous 
function of 7, and so is T(t). Further, note that for every t, T(t) is a Markov 
time for {X,(t)}. 


Theorem 16.51. X(t) = X,(T(t)) is a Feller process on natural scale with 
speed measure 
m(dx) = V(x) dx. 


Proof. For any Markov time t*, X,(t*) is measurable ¥(X(s), s < t*). 
Further, it is easy to show for any two Markov times të < tj, that 


F(Xo(s), 5 < tf) € F(X(s), s < t3). 


This implies that X(r) = X,(T(r)) is measurable F(X,(s), s < T(t)), for any 
T <t. So 


F(X(s),5 < t) = F(X(5), 5 < TH). 
Hence P,(X(t + 7)E B| X(s), s < t) equals the expectation of 
P,(X(t + 7) € B| X,(s), s < T(t), 
given ¥(X(s), 5 < t). To evaluate T(t + 7), write 
Tkt+r)} å 
t+7= Í V(X(é)) dé = Í V(X AE + TO) dé + t, 
0 
where A = T(t + +) — T(t). Thus A is the solution of l,(r) = 7, where 


hE) = | VOE + TO) dë. 
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Now X(t + 7) = Xo(T(¢ + 7)) = Xo(A + T(t)). Because 1,(r) is the same 
function on the process X(+ + T(t)) as I(r) is on X,(-), the strong Markov 
property applies: 
PAX (A + Ti) € B| Xs), s < T) 
= Py cn X(T) € B) = Px (X67) € B). 


The proof that X(t) is strong Markov is only sketched. Actually, what we 
will prove is that the transition probabilities for the process are stable. 
Let g(x) be bounded and continuous on F. Denote by t¥ the first passage 
time of X(t) to y. As y — x, t* — 0 a.s. P,. Hence, by path continuity, 


as y > x, 7 
E,AX(t + E) > Erp(X(0). 
T(t + {(c*)) is defined as the solution of 


t+ et) = Í V(Ko(E)) dé. 
0 
Thus A = T(r + I(t*)) — t* is the solution of 


A 
t= Í V(X (E + ty)) dé. 
Use the strong Markov property of X,(t) to compute 


E,AX(t + 1) = E,p(X(A + 3) = E,9(X(0). 


Thus E,(X(2)) is continuous. 
Denote by t*(J) the first exit time of X(t) from J, t*(J) the exit time for 

X(t). The relationship is 

i 


(16.52) 0) = T6) or U) = Í i 


(1 


J) 
V(Xo(E)) dé. 
Taking expectations, we get 
m(x, J) = E, f EOL 
By Problem 10, the latter integral is 
[Gots DV Oma») = | Guts, DVO) ay. 


Thus, the process has the asserted speed measure. 
To show X(t) is on its natural scale, take J = (a, b), 


PAXE) = b) = P(X (TE) = b) = PXU) = b). 


If m(dx) is not absolutely continuous, the situation is much more difficult. 
The key difficulty lies in the definition of I(r). So let us attempt to transform 
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the given definition of I(r) above into an expression not involving V(x): 
Let L(t, J) be the time that X,(&) spends in J up to time ¢, 
LU, J) = HE; XE) €J, E < t). 
Then I(r) can be written approximately as 
L(t, J, 
IO) = F VEDLE Jy) = FED mes), 
x E al 
Suppose that 
im Lit, J) 
sia [lJ 


exists for all y, £. Then assuming the limit exists in some nice way in y, 
we get the important alternative expression for I(r), 


(16.53) I(r) = [Uc y)m(dy). 


That such a function L*(¢t, x) exists having some essential properties is the 
consequence of a remarkable theorem due to Trotter [136]. 


= L*(t y) 


Theorem 16.54. Let X(t) be unrestricted Brownian motion on R®. Almost 
every sample path has the property that there is a function \*(t, y) continuous on 
{(t,y); t > 0, y © RY} such that for all B € B, 


HE; XE) E B, E < t} =| ee y) dy. 


Remarks. L*(zż, y) is called local time for the process. It has to be non- 
decreasing in t for y fixed. Because of its continuity, the limited procedure 
leading up to (16.53) is justified. The proof of this theorem is too long and 
technical to be given here. See Ito and McKean [76, pp. 63 ff.] or Trotter 
[136]. 

Assume the validity of 16.54. Then it is easy to see that local time 
Lž(r, y) exists for the Brownian motion with reflecting boundaries. For 
example, if x = 0 is a reflecting boundary, then for B € &,((0, œ)), 


HE IXE € B, E < t} = REG y)dy, 


where L f(t, y) = L*(t, y) + L*(t, —y). 


Definition 16.55. Take X(t) to be Brownian motion on F, instantaneously 
reflecting at all closed endpoints, with local time \*(t, y). Let m(dx) be any 
measure on B,(F) such that 0 < m(J) < œ for all finite open intervals J with 
J < int(F), Denote 


Kr) = [ UG, mlady), 
and let T(t) be the solution of \(r) = t. 
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Since m(F) is not necessarily finite, then along some path there may be an r 
such that I(r) = œ; hence I(s) = œ, s >r. But if I(r) < œ, then (s) is 
continuous on 0 <s <r. Furthermore, it is strictly increasing on this 
range. Otherwise there would be ans, 1,0 <£ < s <r such that L*(¢, y) = 
L*(s, y) for all ye F. Integrate this latter equality over F with respect to dy 
to get the contradiction s = ¢. Thus, T(t) will always be well defined except 
along a path such that there is an r with I(r) < œ, I(r+) = œ (by the 
monotone convergence theorem I(r) is always left-continuous). If this 
occurs, define T(t) = r for all £ > I(r). With this added convention, T(t) is 
continuous and strictly increasing unless it becomes identically constant. 


Theorem 16.56. X,(T(t))is a Feller process on natural scale with speed measure 
m(dy). 


Proof. That X(t) = X,(T(t)) is on natural scale is again obvious. To 
compute the speed measure of X(t), use (16.52), t#(J) = T(t*(J)) or 
I(t*(J)) = t*(J). Hence 


m(x, J) = Í E,1*(t%(J), y)m(dy). 


The integrand does not depend on m(dy). By the definition and continuity 
properties of local time the integral 


Í E,UV(gW), y)dy, IEJ 
I 


is the expected amount of time the Brownian particle starting from x spends 
in the interval / before it exits from J. We use Problem 10 to deduce that 
this expected time is also given by 


i Gx, y) dy. 


The verification that £, l*(t#(J), y)is continuous in y leads to its identification 
with G(x; y), and proves the assertion regarding the speed measure of the 
process. Actually, the identity of Problem 10 was asserted to hold only for 
the interior of F, but the same proof goes through for J including a closed 
endpoint of F when the extended definition of G(x, y) is used. 

The proof in 16.51 that X(t) is strong Markov is seen to be based on two 
facts: 


a) T(t) is a Markov time for every t > 0; 
b) T(t + 7) = A + T(t), where A is the solution of 1,(r) = 7, and l (r) is 
the integral I(r) based on the process Xl + TO). 


It is easy to show that (a) is true in the present situation. To verify (b), ob- 
serve that L*(t, y) is a function of y and the sample path for 0 < £ < t. 
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Because 


KE; XE) E BE <r +s) 
= KE; X (Ae B, E < r} + HE; XE + ne B0< F< 5}, 


it follows that 
U*(r + s, y) = U*(r, y) + L**(s, y), 


where [**(s, y) is the function L*(s, y) evaluated along the sample path 
X,(r + &),0 < & < s. Therefore 


(r +s) = I(r) + 16), 


where I,(s) is I(s) evaluated along the path X,(r + £), 0 < £ < s. The rest 
goes as before. 


Many details in the above proof are left to the reader to fill in. An 
important one is the examination of the case in which I(r) = œ for finite r. 
This corresponds to the behavior of X(t) at finite boundary points. In 
particular, if at a closed boundary point 5 the condition of accessibility 
(16.43) is violated, then it can be shown that the constructed process never 
reaches b. Evidently, for such measures m(dy), the point b should be deleted 
from F. With this convention we leave to the reader the proof that the 
constructed process X(t) is regular on F. 


Problem 15. Since l *(, 0) is nondecreasing, there is an associated measure 
1 *(dt, 0). Show that L*(dt, 0) is concentrated on the zeros of X,(¢). That is, 
prove 


Í L*(d&, 0) = L*(t, 0). 

(§:Xo(6)=0,8<t} 

(This problem illustrates the fact that L*(t, 0) is a measure of the time the par- 
ticle spends at zero.) 


9. THE CHARACTERISTIC OPERATOR 


The results of the last section show that corresponding to every speed 
measure m(dx) there is at least one Feller process on natural scale. In fact, 
there is only one. Roughly, the reasoning is that by breaking F down into 
smaller and smaller intervals, the distribution of time needed to get from 
one point to another becomes determined by the expected times to leave 
the small subintervals, hence by m(dx). But to make this argument firm, 
an excursion is needed. This argument depends on what happens over small 
space intervals. The operator S which determines behavior over small time 
intervals can be computed by allowing a fixed time ¢ to elapse, averaging 
over X(t), and then taking ż —> 0. Another approach is to fix the terminal 
space positions and average over the time it takes the particle to get to these 
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terminal space positions. Take x € J, J open, and define 
ES XED) — s 
Et*(J) 


Let J | {x}, if lim (U f)(x) exists, then see whether, under some reasonable 
conditions, the limit will equal (Sf X(x). 


(16.57) (Us f(x) = 


Definition 16.58. Let x © F. For any neighborhood J of x open relative to F, 
suppose a function g(J) is defined. Say that 
lim ¢(J) = « 
JU {x} 
if for every system J,, of such neighborhoods, J, | {x}, lim 9(J,) = a. 
n 


Theorem 16.59. Let f Ee D(S, C), where C consists of all bounded continuous 
functions on F. Then 


pala. x) = (Sf)(x) 
forall x é F. 
Proof. The proof is based on an identity due to Dynkin. 


Lemma 16.60. For any Markov time t* such that E,t* < œ, for f € D(S, C) 
and g = Sf, 


+ 


ESX) — J0) = E, f * A(X) dt. 


Proof. For any bounded measurable h(x) consider f(x) = (R,h)(x) and write 


(16.61) E, > f(X(t*) = E, Í Pga h(X(t + t*)) dt 


= E, Í y e **h(X(t)) dt 


= f(x) — E, Í. © WKO) de. 


For f in D(S, C), take h = (A — S)f. Then by 15.51(2), f = R,h and (16.61) 
becomes 


t* t* 
Ee f(X(t*)) F(a) = Ex | NIXA- AEs | MAKE) )dt, 
0 0 
where the last integral exists for all A > 0 by the boundedness of f(x) and 
E,t* < œ. Taking 4 — 0 now proves the lemma. 


To finish 16.59, if x is an absorbing or exit endpoint, then both ex- 
pressions are zero at x. For any other point x, there are neighborhoods J 
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of x, open relative to F, such that £,t*(J) < œ. Now g is continuous at x, 
so take J sufficiently small so that |g(X(¢)) — g(x)| < €, t < t*(J). Then by 
the lemma, 


ESAD) -SO = (8G) + JEL). 
Definition 16.62. Define (Uf)(x) for any measurable function f as 


lim (Uzf)(x) 
JN {x} 


wherever this limit exists. Denote f € D(U, I) if the limit exists for all x € I. 
Corollary 16.63. If f € D(S, C), then f € D(U, F). 


There are a number of good reasons why U is usually easier to work with 
than S. For our purposes an important difference is that U is expressible 
in terms of the scale and speed measure. For any Feller process we can show 
that (Uf)(x) is very nearly a second-order differential operator. There is a 
simple expression for this when the process is on natural scale. 


Theorem 16.64. Let X(t) be on its natural scale, f € D(S, C). Then 
1 
Uf\(x) = == 
(Uf)(x) rR L L i 


in the following sense for x € int (F), 


i) f'(x) exists except perhaps on the countable set of points {x; m({x}) > 0}. 
ii) For J = (xı, Xa) finite, such that f'(x,), f (x2) exist, 


fea) — f(y) = 2 ie a(x)m(dx), 
where g(x) = (Sf)(x). 
Proof. Use 16.60 and Problem 10 to get 
ESX) — fœ = iE G(x, y)g(y)m(dy). 


Take J = (x — hı, x + h) and use the appropriate values for pt(x, J) to 
get the following equations. 


fle + hy) = $0) _ $00) = f=) _ HE asee, natn 
hz hı hıh 
N x—-h<y <x, 
ttt G(x, y) = j 
Bot erica ea x< y< x+h. 
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Taking hı, ha | 0 separately we can show that both right- and left-hand 

derivatives exist. When /,, h, both go to zero, the integrand converges to 

zero everywhere except at y = x. Use bounded convergence to establish (i). 
By substituting x + h, = b, x — h, = a, form the identity, 


(x — a) f(b) + (b — x) f(a) — f(x)(6 — a) = (b — a G(x, y)a(y)m(dy). 
Substituting in x + 4 for x and subtracting we get 


n f+) -So 


f(b) — f(a) — (b — i 


I Gj(x + h, y) — G(x, y j 
The integrand is bounded for all A, and 


Gx + h, y) — Gy, y) _ a —a), y<x, 


lim (b — a) 
nto h 2(b — y) y>x. 


Therefore g(x)m({x}) = 0 implies that 


x b 
f(b) — f(a) — b — af) = -2 l (y — agdm +2 Í (b — y)g dm. 
Take a < xı < xa < b, such that 


gmx) = gx)m({x:}) = 0. 


Use both x, and x, in the equation above and subtract 
xg zı 
(6 = AE -S'E = 2|" 0 — ag dm + 2f" (0 — y)g dm 
z1 zz 


= 2(b — a) f adm: Q.E.D. 


There is an interesting consequence of this theorem in an important 
special case: 
Corollary 16.65. Let m(dx) = V(x) dx on B(int F) where V(x) is continuous 
on int(F). Then f € D(S, C) implies that f"(x) exists and is continuous on 
int(F), and for x € int(F), 
1 d®f(x) 


(UfXx) = wao 
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Problems 
16. If b is a closed left endpoint, f € D(S, C), show that 


b reflecting > f(b) = m({b})(Sf)(6), 
where /,(b) is the one-sided derivative 
fn + = FO) b+theF. 


alo h 
Show also that 


b absorbing or exit > (Sf)(b) = 0. 
17, If X(t) is not on its natural scale, use the definition 


a(x) = (<7) (x) <> fe) — fy) = Í 80) dua) 


u 


Show that for f € D(S, C), x € int(F), 


(Ux) = ae F. (x). 


10. UNIQUENESS 

U is given by the scale and speed. But the scale and speed can also be 
recovered from U. 

Proposition 16.66. p*(x, J) satisfies 


(Up) = 0, xed; 
and m(x, J) satisfies 
(Um\(x) = —1, xeJ, 


for J open and finite, J € F. 
Proof. Let I< J, I = (xı, x). Then for x € J, 


E,p*(X(t*()), J) = pt (xe, J)pt(x, D + pts, J)p x, I) = px, J), 
Egn(X(t*(D,J)) = m(X2, J)pt(x, I) + ma, J)p-(x, I) = m(x, J) — m(x, D). 


For b a closed reflecting endpoint and J = [b, c), even simpler identities 
hold and 16.66 is again true for J of this form. 

To complete the recovery of m, pt from U, one needs to know that the 
solutions of the equations in 16.66 are unique, subject to the conditions 
m,p* continuous on J = (a,b), and pt(b—,/J)=1, p*(a+,J)= 0, 
m(a+,J) = 0, m(b—, J) = 0. 
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Proposition 16.67. Let J = (a, b). If f(x) is continuous on J, then 
(Uf(x%)=0, xe, fa) =fb) =0>f) =0, xeJ. 

Proof. This is based on the following minimum principle for U: 


Lemma 16.68. If h(x) is a continuous function in some neighborhood of 
Xo E F, if h € DU, {xo}), and if h(x) has a local minimum at Xo, then 


(Uh)(Xo) 2 0. 


This is obvious from the expression for U. Now suppose there is a function 
g(x) continuous on J such that gy > 0 on J and Up = Ap on J, A> 0. 
Suppose that f(x) has a minimum in J, then for e > 0 sufficiently small, 
Jf — ey has a minimum in J, but 


U(f — ep) = —Aep < 0 onJ. 


By the minimum principle, f cannot have a minimum. Similarly it cannot 
have a maximum, thus f = 0. All that is left to do is get y. This we will do 
shortly in Theorem 16.69 below. 

This discussion does not establish the fact that U determines m({b}) at 
a closed, reflecting endpoint b, because if J = [b, c), then m(b, J) #0. In 
this case suppose there are two solutions, fi, f, of Uf = —1, x € J, such that 
both are continuous on J, and f,(c) = f(c) = 0, fi(b) = «, fab) = B > a. 
Form 

fy =f Ena so uf=É-1>0. 


x 
The minimum principle implies that f(x) < 0 on J. But (Uf)(6) > 0 implies 
there is a number d, b < d < c, such that f(d) — f(b) > 0. Hence $ = a, 
AC) = fX) on J. 
Theorem 16.69. Up = Ap, x € int(F), A > 0, has two continuous solutions 
(x), p_(x) such that 
i) p(x) > 0, p_(x) > 0, x € int(F), 
ii) (x) is strictly increasing, y_(x) is strictly decreasing, 
iii) at closed endpoints b, y, and y_ have finite limits as x — b, at open right 


(left) endpoints p (p-) > ©, 
iv) any other continuous solution of Up = Ap, x € int(F) is a linear com- 
bination of p4, _- 


Proof. The idea is simple in the case F = [a, b]. Here we could take 
p(x) = Ee", g(x) = Ee 


and show that these two satisfy the theorem. In the general case, we have 
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to use extensions of these two functions. Denote for x < z, 
6,(x,z) = Ee" j 
0_(z, x) = E,e*?, 


For x < y < z, the time from x to z consists of the time from x to y plus 
the time from y to z. Use the strong Markov property to get 


(16.70) 9,(x, z) = 0 (x, y0 0,2) (2, x) = 0z, y0 (y, x). 


Now, in general, t* and t* are extended random variables and therefore not 
stopping times. But, by truncating them and then passing to the limit, 
identities such as (16.70) can be shown to hold. Pick z, in F to be the closed 
right endpoint if there is one, otherwise arbitrary. Define, for x < Zp, 


x(x) = Ep, 
Now take z, > Zo and for x < z, define 
p+(x) = E,¢*4/ Be, 


Use (16.70) to check that the g, as defined for x < z, is an extension of p} 
defined for x < Zo. 

Continuing this way we define y,(x) on F. Note that y, > Oon int(F), 
and is strictly increasing. Now define y_ in an analogous way. As w Î y 
orw | y, th — t¥ a.s. P, on the set {t* < oo}. Use this fact applied to (16.70) 
to conclude that p, and y_ are continuous. If the right-hand endpoint 6 is 
open, then lim E, exp [—At*] = 0 as z — b, otherwise b is accessible. For 
Z > Zo, the definition of p, gives 


p(z) = 1/E,,e-***, 


SO p,(z) > +œ asz—b. 
For y arbitrary in F, let p(x) = Ee, then assert that 


Proposition 16.71. œ € D(U, F — {y}), and for x # y, 
Up = Ag. 


Proof. Take h(x) € C such that h(x) = 0 for all x < y. Then f= R,h is in 
D(S, C). By truncating, it is easy to show that the identity (16.61) holds for 
the extended stopping variable t*. For x < y, (16.61) now becomes 


ESS KED) =S) 
SOP) =S), x Sy. 


or 
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By 15.51, (2 — S) f= h, so 


(Uf)(x) = Af(x), x <y. 
But for x < y, 
(Uf)(x) = FOXU px), 
which leads to Up = Ag. An obviously similar argument does the trick for 
x>y. 

Therefore pg}, p- are solutions of Ug = Ap in int(F). Let ọ be any 
solution of Ug = Ag in int(F). For x, < x determine constants c}, Ca so 
that 

P(X2) = CPX) + cop_(%a), 
P(%1) = capx) + Cop_(™). 


This can always be done if ,(%2)¢_(x,) — 9_(%2)94(x,) Æ 0. The function 
D(x) = g (x)p-(x1) — p-(x)p4(xı) is strictly increasing in x, so D(x,) = 
0=> D(x.) > 0. The function G(x) = g(x) — cip4(x) — cop_(x) satisfies 
UG = AG, and G(x,) = G(x.) = 0. The minimum principle implies ¢(x) = 0, 
Xi <x < x, Thus 
AX) = apx) + Cop(*), X1 LX L Xa 
This must hold over int(F). For if c,, c, are constants determined by a larger 
interval (xi, x,), then 
(ci — ¢r)p4(%) + (c3 — capx) = 0, xı Sx Sxe 


This is impossible unless œ}, y_ are constant over (x,, x3). Hence c] = c,, 
1 
Co = Co. 


We can now prove the uniqueness. 


Theorem 16.72. For f(x) bounded and continuous on F, g(x) = (Rif X=), 
A > 0, is in D(U, F) and is the unique bounded continuous solution of 


(Ug)(x) = Ag(x) + f(x), all x. 
Proof. That g(x) € D(U, F) follows from 16.63, and 15.51. Further, 
Sg = ag + f= Ug = g +f. 


Suppose that there were two bounded continuous solutions g,, g, on F. 
Then g(x) = g,(x) — g(x) satisfies 
Up = àp, x EF. 
By 16.69, 
P(x) = eps(x) + cop_(x), x E int(F). 


If F has two open endpoints, ¢ cannot be bounded unless c, = c, = 0. 
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Therefore, take F to have at least one closed endpoint b, say to the left. 
Assume that (b) > 0; otherwise, use the solution — (x). If (6) > 0, 
then 

(Up)(b) > 0 = pb + h) > ph) 


for all A sufficiently small. Then p(x) can never decrease anywhere, because 
if it did, it would have a positive maximum in int(F), contradicting the 
minimum principle. If F = [b, c), then p(x) = cop_(x). Since y_(6) = 1, 
the case g(b) = 0 leads to y(x) = 0. But y(b) > 0 is impossible because 
y_(x) is decreasing. Finally, look at F = [b, c]. If y(c) > 0 or if g(c) < 0, 
then an argument similar to the above establishes that g(x) must be decreasing 
on F or increasing on F, respectively. In either case, y(b) > 0 is impossible. 
The only case not ruled out is »(b) = y(c) = 0. Here the minimum principle 
gives v(x) = 0 and the theorem follows. 


Corollary 16.73. There is exactly one Feller process having a given scale 
function and speed measure. 


Proof. tis necessary only to show that the transition probabilities {p(B | x)} 
are uniquely determined by u(x), m(dx). This will be true if E,f(X(£)) is 
uniquely determined for all bounded continuous fon F. The argument used 
to prove uniqueness in 15.51 applies here to show that E,f(X(t))is com- 
pletely determined by the values of (R,f)(x), 4 > 0. But g(x, 4) = (Rif Xx) 
is the unique bounded continuous solution of Ug = Ag + f, x € F. 


11. (x) AND (¢~_x) 


These functions, (which depend also on A) have a central role in further 
analytic developments of the theory of Feller processes. 

For example, let J be any finite interval, open in F, with endpoints x,, xg. 
The first passage time distributions from J can be specified by the two 
functions 


6t(A, x) -Í eD AP Ax) =| D aP,, 
Ay AW 

where A, = {X(t*(J)) = x}, A_ is the other exit set. 
Theorem 16.74 (Darling and Siegert) 

PO e_-&) — 9) 9_Or) 
PPX) — 91 (%2)9_Orr) 

p(x) Pi) — p-p (x2) 
P(x) p_M%e) — 94 %2)9_-Cx) 


O(A, x) = 


O(A, x) = 
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Remark. Since these expressions are invariant under linear transformations, 
we could just as well take instead of g,, y_ any two linearly independent 
solutions g}, g- of Ug = Ag such that g, fT, g- |. 


Proof. Use the strong Markov property to write 
E, ii = f eat) dP, +(E e~t) f et") gp 
A+ 3 A- E 


Ee >h = (Be) f eap, + f et dP.. 
A 


Let h(x) = E,e*", K(x) = E,e7**4, and solve the above equations to get 
+ + = 
6+(A, x) = h (x) =h (sh (x). 
1 — h (xa)h (xı) 
Since h-(x,) = 1, k*(x,) = 1, this has the form of the theorem. By construc- 
tion A (x), h*(x) are constant multiples of g_(x), p(x). 


More important, in determining the existence of densities of the transition 
probabilities p,(B| x) and their differentiability is the following sequence 
of statements: 


Theorem 16.75 


i) £ V(x), < g(x) both exist except possibly at a countable number of 
x x 


points. 
ii) $= œ la x Py ti is constant except possibly here P+ 
dx dx dx 


d F 
or — y_ do not exist. 
x 


iii) R,(dy | x) K m(dy) on BF), and putting r,(y, x) = a , we have 


1 yo, yx, 
3 
riy >, x) = 


5 9-0 yx, 


or equivalently, 
(R,f\(x) = | fO, x)m(dy) 


for r (y, x) as above, 
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The proof, which I will not give (see Ito and McKean, pp. 149 ff.), goes like 
this: Show that Up,, Ug_ have the differential form given in 16.64. Then 
statement (ii) comes from the fact that 


|, @ U9 — ee) dm = 0. 


Statement (iii) comes from showing directly that the function 


g(x) = Í SO, x)m(dy) 


satisfies Ug = Ag + f for all continuous bounded f. 

The statement 16.75(iii) together with some eigenvalue expansions 
finally shows that p;(dy | x) « m(dy) and that p(y |x), the density with 
respect to m(dy) exists and is a symmetric function of y, x. Also, the same 
development shows that p,(y | x) € D(S, C) and that 


ð 
ôt 
Unfortunately, the proofs of these results require considerably more analytic 
work. It is disturbing that such basic things as existence of densities and 


the proofs that p,(y | x) are sufficiently differentiable to satisfy the backwards 
equations lie so deep. 


py | x) = (Spy |O). 


Problem 18. If regular Feller process has a stationary initial distribution 
m(dx) on F, then show that m(dx) must be a constant multiple of the speed 
measure m(dx). [Use 16.75(iii).] 


12. DIFFUSIONS 


There is some disagreement over what to call a diffusion. We more or less 
follow Dynkin [44]. 


Definition 16.76. A diffusion is a Feller process on F such that there exist 
functions o?(x), u(x) defined and continuous on int(F), o*(x) > 0 on int(F), 
with 


i) PAX) —x|>)-2 0, 


ine f (X) — x) dP, > u(x), 
t J[x(i)—-2] <e 


iii) 1 f (X(t) — x} dP, => 0°%(x), 
t Jixt—el]<e 


where the convergence (2+) is bounded pointwise on all finite intervals J, 
J c int(F). 
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So a diffusion is pretty much what we started in with—a process that is 
locally Brownian. Note 


Proposition 16.77. Let f(x) have a continuous second derivative on int(F). 
For a diffusion, f € D(U, int(F)) and 


A 1, 


(Uf)(x) = 30 Oa THAD 


Proof. Define f(x) = f(x) in some neighborhood J of x such that f(x) is 
bounded and has continuous second derivative on int(F) and vanishes 
outside a compact interval / contained in the interior of F. On 7° prove that 
(S,f)(x) —7> 0. Apply a Taylor expansion now to conclude that fe #(S, C), 
and that (Sf)(x) is given by the right-hand side above. 


The scale is given in terms of u, o by 


Proposition 16.78. For a diffusion, the scale function u(x) is the unique (up to a 
linear transformation) solution on int(F) of 


d? =o 


(16.79) irpo E oy HO) 


dx 


=0. 


Proof. Equation (16.79) has the solution 


dtio(x) ais |- 7 2u(z) 3 ] 


dx zo 0°(Z) 


for an arbitrary X9. This u(x) has continuous second derivative on int(F). 
Thus by 16.77 (Uuy)(x) = 0, x € int(F), and u(x) is a scale function. 

Now that the scale is determined, transform to natural scale, getting the 
process 


(16.80) 


X(t) = u(X(1)). 
Proposition 16.81. X( t) is a diffusion with zero drift and 
Py) = PAP, y = ul). 
Proof. Proposition 16.81 follows from 


Proposition 16.82. Let X(t) be a diffusion on F, w(x) a function continuous on 
F such that |w'(x)| Æ 0 on int(F), w’(x) continuous on int(F). Then X(t) = 
w(X(t)) is a diffusion. If & = w(x), 

A(X) = 40°(x)w"(x) + uawa), 

G(X) = Pw 
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Proof. The transformations of the drift and variance here come from 


E,w(X() — w(x) 
t 


E{w(X() = wo)? _ Ew) — wd) _ T E,w(X(#)) — w(x) 
t t t 


—> Sw(x), 


— Sw? — 2w- Sw. 


À i yn | ee 
What needs to be shown is that for the revised process, y a, Bl G(X) 
takes place on finite intervals in the interior of F. This is a straightforward 
verification, and I omit it. 


By this time it has become fairly apparent from various bits of evidence 
that the following should hold: 


Theorem 16.83. For a diffusion with zero drift the speed measure is given on 
Balint F) by 
1 
= —— dx. 
m(dx) FG) dx 
Proof. Take f to be zero off a compact interval / € int(F), with a continuous 
second derivative. There are two expressions, given by 16.77 and 16.64, for 
(Uf)(x). Equating these, 
x df) d df 
a(x) == = [— — |(x 
(9 dx? (a ak ) 
or, 


re-so oy) f"(yym(dy) 


= f"(y) dy 


(x1 ,%2) 
which implies the theorem. 


These results show that u(x), o%(x) determine the scale function and 
the speed measure completely on int(F). Hence, specifying m on any and 
all regular boundary points completely specifies the process. The basic 
uniqueness result 16.73 guarantees at most one Feller process with the given 
scale and speed. Section 8 gives a construction for the associated Feller 
process. What remains is to show that the process constructed is a diffusion. 


Theorem 16.84. Let m(dx) = V(x) dx on int(F), V(x) continuous and positive 
on int(F). The process on natural scale with this speed measure is a diffusion 
with u(x) = 0, ox) = 1/V(x). 
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Proof. Use the representation X(t) = X,(T(‘)), where T(t) is the random 
time change of Section 8. The first step in the proof is: Let the interval 
J, = (x — e, x + e), take J © int(F) finite such that for any x eJ, 
Je © Ic int(F). Then 


i°) ADP) <1) +0, xeJ. 
Proof of (°). {t*(J.) < t} = {t#(VJ.) < T(t)}. In the latter set, by taking 
inverses we get 


wasto) = {[ "OK (O) db < i) 


But letting M = inf V(y), we find that, denoting J? = (—e, €), 
I 


to( Je) 
Paf |T VODJE <1) < PAMBO <1) = PAGED < 1M). 
0 
Since P,(|X(t) — x| > €) < P,(t*(/.) < t), condition (i) of the diffusion 
definition 16.76 is satisfied. 
To do (ii) and (iii): By (i°) it is sufficient to prove 


1 bp 1 bp 1 
= X(t) — x) dP, —> 0, = X(t) — x)? dP, —> —., 
t sagas 9 ) aden © ) V(x) 


for x in finite J, J € int(F). 
Let t* = min (t, t*(/.)). Then we will show that 


ii’) E,X(tt) =x,  E, X(t) = x? + E,T (T$). 
Proof of (ii°). Use (16.52); that is (dropping the e€), 
to(J) = T(t*(J)), 
X(e*) = X(T (1*)) = X(min (T(N, to(J))). 


But t* = min (T(1), t*(J)) is a stopping time for Brownian motion. Since 
X(t), X2(¢) — t are martingales, (ii°) follows. 


Use (ii°) as follows: 


Í X(t) dP, = f X(T) dP, = x — X(t?) dP, 
{<t o) tst“ UZT o) 


Hence, wẹ get the identity 


— ) = — X NS : 
hra t "are f E TO a 
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This latter integral is in absolute value less than ¢P,(t*(J.) < t). Apply 
(i°) now. 
For (iii), write 


Í (X(t) — xf dP, = (X(#2) — x) dP, 
{tct J o) 


(t<t"Je)} 


= E,(X(x2) — x) — i (X(T) — x) dP, 


t>t*(J)} 
By (ii°), 
y Gi) E (x * 2 * 
z (Te) E x) mer ET (Te). 


The remaining integral above is bounded by ¢?P,(t*(J.) < t), hence gives 
no trouble. To evaluate £,T(t*), write 


Tire) 
t> Í V(XLE)dE > MTG”), 
0 
so that 
aL. 
M 


FEM) 
for all t and x € J. Further, since T(t) > 0 as t — 0, 


t 1 are PEE 
a! (Xo(6)) dE + V(X,(0)) a.s. 


as £ — 0. Since T(t*)/t is bounded by M, the bounded convergence theorem 
can be applied to get 
lim t BE, T(x") = —— 
tno t V(x) 


for every x E€ J. 


These results help us to characterize diffusions. For example, now 
we know that of all Feller processes on natural scale, the diffusion processes 
are those such that dm « dx on int(F) and dm/dx has a continuous version, 
positive on int(F). A nonscaled diffusion has the same speed measure 
property, plus a scale function with nonvanishing first derivative and con- 
tinuous second derivative. 


Problem 19. Show that for a diffusion, the functions œ}, p- of Section 10 
are solutions in int(F) of 


d d 
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NOTES 


Most of the material in this chapter was gleaned from two recent books on 
this subject, Dynkin [44], of which an excellent English translation was 
published in 1965, and Ito, McKean [76, 1965]. 

The history of the subject matter is very recent. The starting point was 
a series of papers by Feller, beginning in 1952 [57], which used the semigroup 
approach. See also [58]. Dynkin introduced the idea of the characteristic 
operator in [43, 1955], and subsequently developed the theory using the 
associated concepts. The idea of random time substitutions was first exploited 
in this context by Volkonskii [138, 1958]. The construction of the general 
process using loca] time was completed by Ito and McKean [76]. 

The material as it stands now leaves one a little unhappy from the 
pedagogic point of view. Some hoped-for developments would be: (1) a 
simple proof of the local time theorem (David Freedman has shown me a 
simple proof of all the results of the theorem excepting the continuity of 
L*(t, y)); (2) a direct proof of the unique determination of the process by 
scale function and speed measure to replace the present detour by means of 
the characteristic operator; (3) a simplification of the proofs of the existence 
of densities and smoothness properties for the transition probabilities. 

Charles Stone has a method of getting Feller processes as limits of birth 
and death processes which seems to be a considerable simplification both 
conceptually and mathematically. Most of this is unpublished. However, 
see [131] for a bit of it. 

The situation in two or more dimensions is a wilderness. The essential 
property in one dimension that does not generalize is that if a path-continuous 
process goes from x to y, then it has to pass through all the points between 
xand y. So far, the most powerful method for dealing with diffusions in 
any number of dimensions is the use of stochastic integral equations (see 
Doob [39, Chap. V1], Dynkin [44, Chap. XIJ) initiated by Ito. The idea 
here is to attempt a direct integration to solve the equation 


AY = p(Y) At + œ(Y) AX 


and its multidimensional analogs. 


APPENDIX 


ON MEASURE AND FUNCTION THEORY 


The purpose of this appendix is to give a brief review, with very few proofs, 
of some of the basic theorems concerning measure and function theory. 
We refer for the proofs to Halmos [64] by page number. 


1. MEASURES AND THE EXTENSION THEOREM 


For Q a set of points w, define 


Definition A.1. A class F of subsets of Q is a field if A, B€ F implies A’, 
AUB,A N Barein F. The class F is o-field if it is a field, and if, in addition, 
A, € F,n = 1,2,... implies UP A, EF. 


Notation A.2. We will use A° for the complement of A, A — B for A A B, 
Ø for the empty set, A A B for the symmetric set difference 


(A — B) U (B — A). 
Note 


Proposition A.3. For any class C of subsets of Q, there is a smallest field of 
subsets, denoted F(C), and a smallest o-field of subsets, denoted F(C), 
containing all the sets in C. 


Proof. The class of all subsets of Q is a o-field containing C. Let F(C) be the 
class of sets A such that A is in every o-field that contains C. Check that 
F(C) so defined is a o-field and that if F is a o-field, C € F, then F(C) € F. 
For fields a finite construction will give F(C). 


If A,, is a sequence of sets such that A, © Apy n = 1,2,... and A = 
U A,, write A, Ì A. Similarly, if Anı € An» A = N Ap write A, | A. 
Define a monotone class of subsets C by: If A, E€ ©, and A, t Aor A, | A 
then AEC, 


Monotone Class Theorem A.4 (Halmos, p. 27). The smallest monotone class 
of sets containing a field Fy is ¥(¥,). 
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Definition A.5. A finitely additive measure u on a field F is a real-valued 
(including + 00), nonnegative function with domain F such that for A, BEF, 
ANB=9, 

H(A U B) = u(A) + p(B). 


This extends to: If A,,..., A, © F are pairwise disjoint, A; O A; = Ø, 
i Æ j, then 5 

„(Ù A,) = $A) 
Whether or not the sets 4;,..., A, E F are disjoint, 


„(Ù A,) < $ ua. 


Definition A.6. A d-additive measure (or just measure) on a o-field F is a 
real-valued (+ œ included), nonnegative function with domain F such that for 
Ai... EF, A; NA; = O,i # j, 


„(U A) = ZMA») 


We want some finiteness: 


Definition A.7. A measure (finitely or o-additive) on a field F, is o-finite if 
there are sets A, E€ Fo such that Uy A, = Q and for every k, u(A,) < ©. 


We restrict ourselves henceforth to a-finiteness! The extension problem for 
measures is: Given a finitely additive measure xo on a field Fo, when does 
there exist a measure u on F(F,) agreeing with xo on Fẹ? A measure has 
certain continuity properties: 


Proposition A.8. Let u be a measure on the o-field F. If A, | A, A, EF, 
and if p(A,,) < œ for some n, then 


lim u(4,) = (A). 
Also, if A, f A, A, € F, then 
lim u(A,,) = uA). 


This is called continuity from above and below. Certainly, if u is to be 
extended, then the minimum requirement needed is that x, be continuous on 
its domain. Call xg continuous from above at Ø if whenever A, € Fy, A, | Ø, 
and uod) < œ for some n, then 


lim 4(A,,) = 0. 
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Caratheodory Extension Theorem A.9. If tg on Fy is continuous from above 
at Ø, then there is a unique measure u on F(F,) agreeing with uo on F, (see 
Halmos, p. 54). 


Definition A.10. A measure space is a triple (Q, F, p) where F, u are a 
a-field and measure. The completion of a measure space, denoted by (Q, F, ji), 
is gotten by defining A € F if there are sets A,, A, in F, A, © A € A, and 
H(A, — Ay) £ 0. Then define fi(A) = p(A)). 


F is the largest o-field for which unique extension under the hypothesis of 
A.9 holds. That is, @ is the only measure on F agreeing with up on F, and 


Proposition A.11. Let B ¢ ¥, F, the smallest o-field containing both B and F. 
Then there is an infinity of measures on F, agreeing with Uy on Fy. (See 
Halmos, p. 55 and p. 71, Problem 3). 


Note that F(F,) depends only on F, and not on fp, but that F(F 4) 
depends on uo. The measure u on F(F 4), being a unique extension, must be 
approximable in some sense by fy on Fa. One consequence of the extension 
construction is 


Proposition A.12 (Halmos, p. 56). For every A E€ ¥(¥ ), and e > 0, there isa 
set Ag € Fy such that 
H(A A Ay) S €. 


We will designate a space Q and a o-field F of subsets of Q as a measurable 
space (Q, F). If F © Q, denote by F(F) the o-field of subsets of F of the 
form A N F, A e F, and take the complement relative to F. 

Some important measurable spaces are 


RY the real line 

By, the smallest o-field containing all intervals 

Re k-dimensional Euclidean space 

By the smallest o-field containing all k-dimensional rectangles 


Ri) the space of all infinite sequences (x,, x2, . . .) of real numbers 


Bo the smallest o-field containing all sets of the form {(x1, X2, . - .); 
x,Eh,...,x, €1,} for any n where f,...,/7, are any 
intervals 

R! the space of all real-valued functions x(#) on the interval 
Ic R® 

Bı the smallest o-field containing all sets of the form {x(-) € RŽ; 


x(t) €,,..., x(t,) E In} for any t,,...,¢, EZ and intervals 


ay 8 
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Definition A.13. The class of all finite unions of disjoint intervals in R® 
is a field. Take mo on this field to be length. Then jy is continuous from 
above at Ø (Halmos, pp. 34 ff.). The extension of length to B, is Lebesgue 
measure, denoted by I or by dx. 


Henceforth, if we have a measure space (Q, F, u) and a statement holds 
for all œ € Q with the possible exception of w € A, where u(A) = 0, we say 
that the statement holds almost everywhere (a.e.) 


2. MEASURABLE MAPPINGS AND FUNCTIONS 


Definition A.14. Given two spaces Q, R, and a mapping X(w): Q -> R, the 
inverse image of a set B © R is defined as 


XB = {w € Q; X(w) € B}. 


Denote this by {X € B}. The taking of inverse images preserves all set oper- 
ations; that is, 


J 
[xE NB] = NX eB) 


{X eB°} = {XeBY. 
Definition A.15. Given two measurable spaces (Q, F), (R, B). A mapping 
X: Q — Ris called measurable if the inverse of every set in B is in F. 
Proposition A.16 (See 2.29). Let CE B such that F(C) = B. Then 
X: Q — Ris measurable if the inverse of every set in C is in F. 
Definition A.17. X: Q — R will be called a measurable function if it is a 
measurable map from (QO, F) to (R"), B,). 


From A.16 it is sufficient that {X < x} e€ F for all x in a set dense in 
R‘), Whether or not a function is measurable depends on both Q and F. 
Refer therefore to measurable functions on (Q, ¥) as #-measurable functions. 


[xey B) = U{xXeB,}, 
a a 


Proposition A.18. The class of ¥-measurable functions is closed under point- 
wise convergence. That is, if X,(w) are each ¥-measurable, and lim,, X,(w) 
exists for every w, then X(w) = lim, X,(@) is ¥-measurable. 


Proof. Suppose X,(w) | X(w) for each w; then {X < x} = U, {X, < x}. 
This latter set is in F. In general, if X (w) > X(w), take 


Y,(w) = sup X,,(@). 


Then {Y, > y} = U,,>, {Xm > y} which is in F. Then Y, is ¥-measurable 
and Y,, | X. 
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Define an extended measurable function as a function X(w) which takes 
values in the extended real line R™ U {œ} such that {X e B} € F for every 
BEB. 

By the argument above, if X, are F-measurable, then lim X,, lim X, 
are extended ¥-measurable, hence the set ma 


{lim X, exists} = {lim X, = lim X,} N {llim X,| < œ} 
is in F. 
Proposition A.19 (See 2.31). If X is a measurable mapping from (Q, F) to 


(R, B) and y is a B-measurable function, then ọ(X) is an ¥-measurable 
function. 


The set indicator of a subset A € Q is the function 
(o) l, wéA, 

w) = 
ia 0, wed’. 


A simple function is any finite linear combination of set indicators, 


glow) => Ar% 4,() 
of sets A, E F. 


Proposition A.20. The class of ¥-measurable functions is the smallest class of 
functions containing all simple functions and closed under pointwise convergence. 


Proof. For any n > 0 and X(w) a measurable function, define sets A, = 
{X e [k/n, k + 1/n)} and consider 


= k 
Xn = > 2 = Xa). 
k n 


Obviously X,, —> X. 


For any measurable mapping X from (Q, F) to (R, B), denote by 
F(X) the o-field of inverse images of sets in B. Now we prove 4.9. 


Proposition A.21. If Z is an F(X)-measurable function, then there is a 
B-measurable function 6 such that Z = 6(X). 


Proof. Consider the class of functions y(X), X fixed, as m ranges over the 
-measurable functions. Any set indicator 7 4(w), A € F(X), is in this class, 
because A = {X € B} for some Be B. Hence 

xalo) = XX). 


Now the class is closed under addition, so by A.20 it is sufficient to show it 
closed under pointwise convergence. Let ,(X)—> Y, y, -measurable. 
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Let B = {lim, o, exists}. Then B € B, and Q = {X e B}. Define 


limg,, on B 
p = n 
0, on B°. 
Obviously, then Y = ¢(X). 
We modify the proof of A.20 slightly to get 2.38; 
Proposition A.22. Consider a class £ of ¥-measurable functions having the 


properties 


i) X,Yef,af6 >O>aX + BY Ef, 
ti) X, €f,X, TX > Xe, 
iii) for every A € F, x4(w) E Ê; 


then £ includes all nonnegative F -measurable functions. 


Proof. For X > 0, ¥-measurable, let X, = XÈ, k/ny4,(@) where A, = 
{X € [k/n, k + 1/n)}. Then certainly X, € £, and X,, tT X if we take n’ the 
subsequence {2}. 


3. THE INTEGRAL 


Take (Q, F, u) to be a measure space. Let X(w) > 0 be a nonnegative 
¥-measurable function. To define the integral of X let X, > 0 be simple 
functions such that X,, fT X. 


Definition A.23. The integral f X,, du of the nonnegative simple function 
Xn = È ay X4 (0), &x > 0, is defined by X a,u(A;). 

For X,, t X, it is easy to show that f X,,,du > f X, du > 0. 

Definition A.24. Define f X du as lim, f X, du. Furthermore, the value of this 


limit is the same for all sequences of nonnegative simple functions converging 
up to X (Halmos, p. 101). 


Note that this limit may be infinite. For any #-measurable function X, 
suppose that f |X| du < œ. In this case define ¥-measurable functions 


x x 
Xoo) = | (œ), X(w) > 0, 
0, X(w) <0, 

-X XxX (03 

CETO | (w), X(w) < 0, 

0, X(w) > 0. 


Note |X| = X+ + X7. 


A.3 THE INTEGRAL 397 


Definition A.25. If § |X| du < 00, define 


[xa = fx dy = [dus 
we may sometimes use the notation 
f X(w)u(do). 


The elementary properties of the integral are: If the integrals of X and Y 
exist, 


i) X > Y= [X du > fY du, 

ii) [lex + ev dum af X du + ALY du 

iii) A, BeF, A A B = Ø > X du =| X du + f X dp. 
AUB A B 


Some nonelementary properties begin with 


Monotone Convergence Theorem A.26 (Halmos, p. 112). For X, > 0 non- 
negative ¥-measurable functions, X, 1 X, then 


tim | X, du = fX du 
From this comes the 
Fatou Lemma A.27. If X,, > 0, then 


fiim x, du < tim X, du. 


Proof. To connect up with A.26 note that for X,,..., X, arbitrary, non- 
negative #-measurable functions, 


fintOG, XD du < [Xa RE, ET 
Hence, by taking limits 
f inf X, du < inf f X, du. 


Let Y,, = inf Xm; then 


man 


fr du < inf fx dis 
man 


Since Y, > 0, and Y, f lim X,, apply A.26 to complete the proof. 
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Another useful convergence result is: 


Bounded Convergence Theorem A.28 (2.44). Let X,,— X pointwise, where 
the X,, are ¥-measurable functions such that there is an ¥-measurable function 
Z with |X,| < Z, alln, w, and f Z du < œ. Then (see Halmos, p. 110) 


lim È X, du = [ X dy. 


From these convergence theorems can be deduced the o-additivity of an 
integral: For {B,} disjoint, B, € F, and f |X] du < œ, 


Í X du = | X dy. 
UBn Br 


Also, if f |X| du < œ, then the integral is absolutely continuous. That is, 
for every e > 0, there exists a ô > 0 such that if A € F and u(4) < ô, then 


REEL 
A 


4. ABSOLUTE CONTINUITY AND 
THE RADON-NIKODYM THEOREM 


Consider a measurable space (Q, F) and two measures y, » on F. 

Definition A.29. Say that v is absolutely continuous with respect to p, denoted 
v u, if AEF, uA) = 0 => (A) = 0. 

Call two measurable functions X,, X, equivalent if u({X, = X,}) = 0. Then 


Radon-Nikodym Theorem A.30 (Halmos, p. 128). If v & u, then there exists 
a nonnegative #-measurable function X determined up to equivalence, such 
that for any AEF, 


(A) =x du. 


Another way of denoting this is to say that the Radon derivative of » with 
respect to u exists and equals X; that is, 


# =X. 

du 
The opposite of continuity is 
Definition A.31. Say that v is singular with respect to u, written u | v if 
there exists A € F such that 


H(A) = 0, x(4°) = 0. 
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Lebesgue Decomposition Theorem A.32 (Halmos, p. 134). For any two 
measures u, v on F, v can be decomposed into two measures, ¥,, Y, in the sense 
that for every AE F, 
(A) = vA) + ¥,(A) 
and v, & l, % | H. 
For a o-finite measure the set of points {w; u({w}) > 0} is at most 
countable. Call v a point measure if there is a countable set G = {w,} such that 


for every A E€ Ff, 
v(A) = oA N G). 


Obviously, any measure » may be decomposed into », + »,, where v, is a 
point measure and v, assigns mass zero to any one-point set. Hence, on 
(R™®, B,) we have the special case of A.32. 

A 


Corollary A.33. A measure y on B, can be written as 
v= v Er, trp 


where v, is a point measure, v, | 1 but », assigns zero mass to any one-point 
sets, and v, & l. [Recall l is Lebesgue measure.] 


5. CROSS-PRODUCT SPACES AND THE FUBINI THEOREM 


Definition A.34. Given two spaces Qa, Qs, their cross product Q, x Q is the 
set of all ordered pairs {(w,, %2); w, € Qy, ws E Qs}. For measurable spaces 
(Qy, Fa), (Qe, Fo), Fy X Fo is the smallest o-field containing all sets of the 
form 

{(@,, w2); @, E Ay, W. E Ad}, 


where A, E Fa, A2 € Fy. Denote this set by Ay X Ae. 


For a function X(w,, w2) on Q, x Qs, its section at œw; is the function on Q, 
gotten by holding w, constant and letting œ, be the variable. Similarly, if 
A E Q, X Q, its section at w is defined as {w.; (w1, w2) E A}. 


Theorem A.35 (Halmos, pp. 141ff.). Let X bean F, X F -measurable function; 
then every section of X is an F,-measurable function. If A € F X F., every 
section of A is in Fa. 


If we have measures u, on Fa, Ha on F, then 


Theorem A.36 (Halmos, p. 144). There is a unique measure p,X [lz on 
F, X F, such that for every A, E F,, A, © Fa, 


Ha X Me(Ay X Ay) = (A1) H(42). 


This is called the cross-product measure. 
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Fubini Theorem A.37 (Halmos, p. 148). Let X be Fy x Fe-measurable, and 
fix d(uıX pHa) < œ. 


Then 
Í Xe, o)m (do), f Ki wielded 


are respectively Fand F -measurable functions, which may be infinite on sets 
of measure zero, but whose integrals exist. And 


Jana =| (fase) s 
=| (fees 


Corollary A.38. If Ac F, x F, and p,X (A) = 0, then almost every 
section of A has u, measure zero. 


This all has fairly obvious extensions to finite cross products Q, x --- x Q 


n’ 


6. THE L,(u) SPACES 


These are some well-known results. Let (Q, F, u) be a measure space and the 
functions X, Y be ¥-measurable: 


Schwarz Inequality A.39. If f |X|? du and { |Y|*du are finite, then so is 


§ IXY] du and (fe aj <{ fixe an) ( ft an). 


Definition A.40. L (u), r > 0, is the class of all F-measurable functions X 
such that § |X" du < œ. 


Completeness Theorem A.41 (Halmos, p. 107). if X, E L(x) and 
fix, =, X ml” du —>0 
as m, n —> œ in any way, then there is a function X € L,(u) such that 


fix. — X du— 0. 


7. TOPOLOGICAL MEASURE SPACES 


In this section, unless otherwise stated, assume that Q has a metric under 
which it is a separable metric space. Let C be the class of all open sets in Q. 
A measure yu on F is called inner regular if for any A € F 


H(A) = cup HC), 
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where the sup is over all compact sets C = A, Ce F. It is called outer 
regular if , 
H(A) = int H(O), 


where the inf is over all open sets O such that A = O, and Oc F, 


Theorem A.42 (Follows from Halmos, p. 228). Any measure on F(C) is both 
inner and outer regular. 


Theorem A.43 (Follows from Halmos, p. 240). The class of #(C)-measurable 
functions is the smallest class of functions containing all continuous functions on 
Q and closed under pointwise convergence. 


Theorem A.44 (Halmos, p. 241). If | |X| du < œ for X F(C)-measurable, 
then for any € > Q there is a continuous function y on Q such that 


fix- pidu se 


Definition A.45. Given any measurable space (Q, F) (Q not necessarily 
metric), it is called a Borel space if there is \-1 mapping p: Q.<> E where 
E € B such that ọ is ¥-measurable, and gy! is B (E)-measurable. 


Theorem A.46. If Q is complete, then (Q, F(C)) is a Borel space. 


We prove this in the case that (Q, F(C)) is (R™, B,,). Actually, since there 
is a 1-1 continuous mapping between R\ and (0, 1) it is sufficient to show that 


Theorem A.47. ((0, 1)‘, B,(0, 1)) is a Borel space. 


Note. Here (0, 1)‘”? is the set of all infinite sequences with coordinates in 
(0, 1) and B,,(0, 1) means B,,((0, 1)). 


Proof. First we construct the mapping ® from (0, 1) to (0, 1). Every 
number in (0, 1) has a unique binary expansion x = .x,x, ‘++ containing an 
infinite number of zeros. Consider the triangular array 


NBN 
co WN Ww 
ON 


Let 
D(x) = (P(x), Dax), .--), 
where the nth coordinate is formed by going down the nth column of the 
array; that is, 
D(x) = X1 XaXaXq t, 


D(x) = .X3X5Xg 7t, 
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and so on. Conversely, if x € (0, 1), x = (x), x), ...), expand every 
coordinate as the unique binary decimal having an infinite number of zeros, 
say 


tk) (kh)... 


x = Xi Xe 


and define g(x) to be the binary decimal whose nth entry is x{*® if n appears 
in the kth column j numbers down. That is, 


g(x) = xPP xPP yyl she 
Clearly, ® and g are inverses of each other, so the mapping is 1-1 and onto. 
By 2.13, to show ® 8,(0, 1)-measurable, it is sufficient to show that each 
®,(x) is &,(0, 1)-measurable. Notice that the coordinates x,(x), xo(x),... 
in the decimal expansion of x are measurable functions of x, continuous 
except at the points which have only a finite number of ones in their expansion 
(binary rationals). Furthermore, each ®,(x) is a sum of these, for example, 
x(x) x(x) , x(x 
Ox) = 2 taD 4 BO. 

Therefore, every ,(x) is measurable. The proof that p(x) is measurable 
similarly proceeds from the observation that x‘*)(x) is a measurable function 
of x. Q.E.D. 

To prove A.46 generally, see Sierpinski, [123a, p. 137], where it is proved 
that every complete separable metric space is homeomorphic to a subset of 
R™, (See also p. 206.) 


8. EXTENSION ON SEQUENCE SPACE 


A.48 (Proof of Theorem 2.18). Consider the class F, of all finite disjoint 
unions of finite dimensional rectangles. It is easily verified that F, is a field, 
and that 

Bo = FF). 


For any set A € Fy, A = UJ, S, where the S, are disjoint rectangles, define 
P(A) = ¥ ÊS). 
2 


There is a uniqueness problem: Suppose A can also be represented as U, Sps 
where the S, are also disjoint. We need to know that 


2 P(S,) = > P(S)). 
Write ; 
Sp = Sg O A = U (S NS). 
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By part (b) of the hypothesis, 
P(S,) = 2 P(S} O S3); 
so 


2 P(S) = 2 P(S; A S4). 
By a symmetric argument 
> AS) = 2 P(S; A Si). 


Now we are in a position to apply the Carathéodory extension theorem, if we 
can prove that A„ € Fo, A, | Ø implies P(A,) — 0. To do this, assume that 
lim, P(A,) = ô > 0. By repeating some A, in the sequence, if necessary, we 
can assume that 

An = {K3 (X15 +++ 5 Xp) A}, 


where A* is a union of disjoint rectangles in R™. By part (c) of the hypothesis 
we can find a set B* < A* so that B* is a finite union of compact rectangles 
in R™, and if 

B, F {x; (x1, eerie: 9 Xn) E€ Br}, 
then 
ô 


gnti $ 


P(A, m B,) < 
Form the sets C, = N? B,, and put 
Cy = ÈX; (X -© -3 Xn) E Ca}. 
Then, since the A,, are nonincreasing, 


P(A, — C,) SS (A, — By) 
k=1 


< xP (A, = B,) 
k= 
< 6/2. 


The conclusion is that lim,, P(C,) > 6/2, and, of course, C, | Ø. Take points 
x9 EC, XP EC, ..., 


(x) w Ue) 
5 E @ one aan P 
For every n, 


* 
Gl, x ECE. 


Take N, any ordered infinite subsequence of integers such that x{”) —> x, c C¥ 
as n runs through N;. This is certainly possible since x!" e C* for all n. 
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Now take N, © N, such that 
(xi), xi") — (x1, x2) E CT 
as n runs through N,. Continuing, we construct subsequences 
N> N N2- 


Let n, be the kth member of N,, then for every j, x‘") — x, as n goes to infinity 
through {n,}. Furthermore, the point (x,, x2,...) is in C, for every n > 1, 
contradicting C, | Ø. 
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integers, returns from infinity, 
339-340 
backwards and forwards equations, 
340 
asymptotic behavior of transition 
probabilities, 344-345 
Markov times, for Brownian motion, 
268-270 
defined, 323 
Martin, R., 297 
Martingales and submartingales, 
definitions, 83-84 
optional sampling theorem, 84-89 
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